Transaction processing engine for flat file access

To do for the engine development:

REPEATABLE_READ_2 mode implementation;
Replication implementation;
C++ code structure;
Client/server host environment setup (BSD sockets);
Adaptation for a more elaborate model of processed objects (GNU DBMS and hash buckets);
Adaptation for an even more elaborate model (MSQL);
Compliance to "emerging standards" CORBA/IIOP; COM/DCOM; Galileo ;-) ;

To do for ports to hardware/software platforms:

Write a RISC OS module ("in process" mode) work completed;
Write a RISC OS WIMP task ("separate process" mode) work completed in the form of a proxy;
Write a Windows DLL ("in process" mode);

Introduction

Provided with the MS/DOS license of which version is 5.0 or newer is a module named SHARE.EXE which allows concurrent access to the resources of your local host (mainly files) from the many users working on the same LAN as your station. Here are the key facts about what SHARE can do:

Access per "geographical zone" inside your file (the granularity factor being the byte);
Two access modes: read/shared and write/exclusive; a write access to a part of the file causes the rejection of every subsequent access to a part of the same file which is not disjointed with the part being updated. This lock lasts until the application which "owns" it releases it.

The main goal of the software described here is to provide enhancements of this mechanism in various ways:

Transaction mode: elementary writes operations to a file can now be considered as part of a whole transaction and then be validated or rejected (rollbacked) as single unit.
The lock management which is implemented here is based on a mechanism called "multi-versioning" (already present in the Borland Interbase RDBMS) and which allows that a reading transaction does not lock a writing transaction and vice versa. However, there is a case where locking applies: two transactions try to write to parts of the same file which have some bytes in common (they are not disjointed); in this context, the chronologically latest write operation fails.

In addition,

A meaningful aspect of the joint development is that no assumption is made about the format/structure of the files being accessed. At all times, the files are considered as byte streams and hold validated data only.
The current release of the module does not implement lock monitoring so the clients do not wait for their request to be serviced for any reason (including lock waiting!). This situation may evolve in a future release of the engine.
The design which is described in the document could easily be adapted to a replication system (based on the submission of validated writes to the "replicate files" as well as to the main "replicated" data file. It could easily support some more elaborate "data objects" (relational, multidimensional, GIS etc...)

Preliminary definitions

Session

A session is a programmatic entity, a program wishing to retrieve and change the content of a file in a cooperative way (that's the goal of the module) first has to open a session. The session is an engine's internal data structure too (more on this later). Even if only one session is needed per client application to access as many file objects as required, the application may open as many sessions as desired.

Transaction

A transaction is a set of both reading and writing operations over a set of file objects; these operations are to be considered as a single unit of work. The view of the accessed objects by the application during the transaction lifetime evolves according to some factors such as: the transaction own write operations; the concurrency model selected for this transaction and the simultaneous activity of concurrent transactions. A transaction begins as the first request after the session was created or after the previous transaction ceases to exist (thru a commit or rollback procedure) is serviced. Usually the transaction is spreaded over multiple objects and over time (multiple sequential operations over an object); although the commit operation appears to the client application as atomic, practically, it may implies several writes to the main data objects in a coordinate matter. The design of secured Commit operations over multiple objects lead to mechanisms such as the two phase commit which is not currently implemented in the engine (this is due to the prerequisite of keeping the file format/structure unmodified).

Result set

Whenever a client application submits a read request to the engine, the engine usually "builds" the data (read them from the storage device on the "server side") and then eventually converts them in a format suitable for the client application before sending the data over a media (memory or network) to the client application. The data as being stored in a pending status (waiting to be sent to the client in a synchronised and coordinated manner) is called a result set. Currently the engine supports bi-directional cursors for the client applications to browse thru the result set in both directions (ie. forward/backward). Result sets end under the same condition as the transaction ends.

Geographical zone

When a transaction accesses some part of a data file, some parameters have to be specified to define the part of the file which is being accessed:

the beginning of the chunk is specified as the offset of the first byte of the chunk from the beginning of the file;
the end of the chunk is specified as the offset of the byte following the last byte of the chunk;

Whenever two geographical zones referring to the same data file have at least one byte in common (they are not disjointed), we'll say in the document that they "collide". This term will be used in the discussion about the lock mechanism.

Access modes description

When a transaction begins, a specific parameter determines the concurrency level which will be used:

DIRTY_READ: DIRTY_READ is the mode which has the lowest prerequisites about the concurrency in regards to other transactions. It just says that the client do not mind to read unvalidated data from the objects. So, under this transaction mode, two successive read operations on the same portion of the file may create two entirely different result sets based on the activity of concurrent transactions on the file.
READ_COMMITED: READ_COMMITED which restrict the view of the transaction to: validated changes by any transactions in addition to unvalidated changes made by the current transaction. One of the main benefits of the "multi-versioning" based engine is that such mechanism avoids locks between such transactions while reading data and concurrent transaction trying to write to the same data.
REPEATABLE_READ_1: REPEATABLE_READ_1 warrants the client that, from the point a data is read accessed by the transaction, all subsequent reading of that data will retrieve the same "result set" even if a write to this data from a concurrent transaction has been validated since then. Of course writes to the data by the current transaction will be reflected in the returned "result sets" (as for the other transaction modes).
REPEATABLE_READ_2: REPEATABLE_READ_2 warrants that the view of the data by the transaction is constant for all its lifetime (this is like taking a snapshot of the whole content of the file when the transaction begins!).

Write operations are of one of the kinds below:

update of existing data (referred to as UPDATE in the document);
append more data to the end of the file (referred to as INSERT in the document);
deletion of the trailing part of the file (i.e. up to the end of the file) (DELETE in the document).

For every write operation, a lock which is setup remains until the end of the transaction. The following table shows the case when two transactions may lock each other according to their current activity.

T1/T2	UPDATE	INSERT	DELETE
UPDATE	Collision	No	Collision
INSERT	No	Yes	Yes
DELETE	Collision	Yes	Yes

This table is symmetric so it doesn´t matter for which transaction (T1 or T2) such a write operation comes first (e.g. if T2 submits an UPDATE which involves a part of a file which has already been written in the name of the active transaction T1, then the UPDATE operation from T2 fails).
Read related operations are one of the kind below:

ask the engine for building a result set (READ or READ_FOR_UPDATE in the document);
read the whole or part of the result set in the client's memory (FETCH in the document);
terminate a read operation (ALLDONE in the document).

Engine architecture and data structures

The engine processing model is mono-space/mono-thread. This implies that the primitives callable by the client interface API be not blocking. At any time of activity, the engine will process only one client request. I didn´t introduce in this release the ability to associate one separate thread to a client request (although nowadays runtime environments do provide multithreading capabilities).
The current release assumes that the "data object class" is a flat file. This file is considered as a stream of ordered bytes without taking into account a more elaborate data structure. Access to the files is being coded with the help of the ANSI C standard library.
When dealing with linked lists, I use part of the Exec library which is provided as part of the Amiga OS kernel (I recoded needed primitives in ANSI C).
The diagram below shows the headlines of the data structures involved:

Session

typedef enum {
	DIRTY_READ, READ_COMMITED, REPEATABLE_READ1, REPEATABLE_READ2
	} ttmode;
typedef struct
	{
	struct MinNode entete;
	int gsid;
	ttmode transmode;
	struct MinList ssessionfile;
	FILE *hfichlog;
	mfpos_t tfichlog;
	struct MinList sfreesegs;
	} tsession;

entete: linked node to the global list of sessions: At the creation of a new session, after memory has been allocated, a call to attach the new session node to the tail of the session global list is done; similarly, a call to remove this node from the list is performed just before deallocating the memory when dropping the session.
gsid : session identity: Identifier of the session (serves as an external reference for client applications), simply its value is taken from an autoincrementing static C variable.
ssessionfile : Head of list of files for this session
hfichlog, tfichlog and sfreesegs : Logfile definition: During the design phase I made the decision to dedicate a separate log file to each of the active sessions. Logfiles are here to hold pending and unvalidated writes submitted as part of the current transaction as well as results sets (results of a read operation) when the transaction mode is READ_REPEATABLE_1. hfichlog is the file handle (FILE *), tfichlog is the known size of the file and sfreesegs is an auxiliary structure used to remember the free segments inside the log file. As the file is created/opened with the C library tmpfile() function call, there's no need to remove the file after it has been closed (usually at the end of the session).
transmode: transmode is the transaction mode of every transaction created under this session. It is determined at the session creation time and will never change until the session is dropped. A future release of the engine may remedy this shortcoming (i.e. allows the client to change transaction mode in between two transactions).

Related files in the source tree

session.c: session management
files.c: log file management

File

struct Node
	{
	struct Node *ln_Succ;
	struct Node *ln_Prec;
	UBYTE ln_Type;
	BYTE ln_Pri;
	char *ln_Name;
	};
typedef struct
	{
	struct Node fentete;
	struct MinList sfilesession;
	FILE *handle;
	mfpos_t taillfich;
	int nbsessions;
	tsession *pinsdel;
	} tfichier;

fentete: linked node to global list of files: When a client application submit a request for accessing a file which has not yet been accessed. then a new tfichier structure is created and this node is added to the tail of the files' global list.
fentete.ln_Name: file identity: This is a full pathname of the file as passed by the client application. However the string which is passed must refer to an absolute pathname as the string serves as the key of the record.
fentete.ln_Type and fentete.ln_Pri: These are fields which are here only for historical reasons (origin of the library in charge of the linked lists management). They currently are of no use in our context.
sfilesession: head of list of sessions accessing this file
handle and taillfich: file identity: The main data file is open for reading and writing the first time a client application wants to access to this file either for reading or writing. From this point on and up to the death of the engine, the file remains opened. taillfich serves as keeping the size of the file in memory rather than having to compute each time it is needed.
pinsdel: Serves as an optimisation purpose for early detection of a concurrent transaction holding access to this file via an INSERT or DELETE verb.
nbsessions: Indicate the number of active sessions of which current transaction has initiated an operation over this file.

Related files in the source tree

fichiers.c: Main data files management

File x Session

typedef struct {
	struct MinList sellist;
	TYPCID csid;
	unsigned int selflags;
	mfpos_t zdebut;
	mfpos_t zfin;
	mfpos_t seldebut;
	void *pdata;
	} tcursor;

typedef struct {
	struct MinNode fentete_session;
	struct MinNode fentete_fichier;
	tsession *psession;
	tfichier *pfichier;
	struct MinList szones;
	mfpos_t taillfich;
	tcursor curseur;
	} tsesfich;

fentete_session : linked node to session's list

fentete_fichier : linked node to file's list

psession and pfichier

Serve for optimisation purposes as we need a quick access to the "mother session" (or the "father fichier") structure fields given a pointer to a tsesfich structure.

taillfich

holds the view of the data file by the current transaction, it might differ with the actual data file size as the later do not consider DELETE or INSERT action before they are commited (the taillfich field in the tsesfich structure does).

szones : head of list of geographical zones for this structure

curseur.csid: result set definition

Identifier of the cursor; Only one cursor may exist per file-session occurrence. Like for the session identifier, its value is determined at the structure creation time and is taken from an auto incremented static variable. curseur.selflags: result set definition: property of the result set; only bits 0 and 1 are meaningful for this release of the module:

bit 0 is the activity mask; it is set if a result set is pending (after a client application has submitted a data read request);
bit 1 is the backward browse mode allowed bit; if it is set, then an application can read a data that has been already been read as part of this result set.

curseur.sellist: result set definition: head of result set's geographical zones list

curseur.zdebut and curseur.zfin: result set definition

These fields mark the result set as part of the main data file content; their values are determined just like for a geographical zone: zdebut is the offset of the first byte in the result set and zfin is the offset of the byte following the last byte in the result set offsets are computed relative to the start of the main data file.

curseur.seldebut

Current cursor position in the result set; the next data transfer (FETCH call) initiated by the client application will start to get data from this point.

curseur.pdata: result set definition

Serves as a temporary data storage pointer used when whole or part of the result set data are not held in the session's logfile and hence may be overwritten by concurrent transactions.

Geographical zone

typedef enum {
	READ, READ_FOR_UPDATE, UPDATE,
	DELETE, INSERT
	} tatype;

typedef long int TGZID;

typedef struct {
	struct MinNode zentete;
	TGZID gzid;
	mfpos_t zdebut;
	mfpos_t zfin;
	mfpos_t filepos;
	tatype atype;
	} tzone;

zentete: linked node to file x session list
gzid: geographical zone identity: The computation of the value of this field, done at the time of the structure creation is based upon an increment of a static variable.
zdebut and zfin: geographical zone identity: Indicates the lower and upper bounds (respectively) of the main datafile's data involved (as usual, they are expressed as offsets from the beginning of the main data file).
filepos: This zone will indicate where in the logfile the data really are (offset from the beginning of the file). This field should not be confused with the zdebut field as the latter marks the logical mark (that is in the main data file) where the implied data are stored. The table below shows which fields are meaningful under which write data related activity.

Activity (atype)	zdebut	zfin	Logfile data (filepos)
UPDATE	meaningful	meaningful	new data
INSERT	-	meaningful	new data
DELETE	meaningful	-	-

In addition, the table below shows which fields are meaningful for which read data activity and under which transaction mode:

Transaction mode	Activity	zdebut	zfin	Log file data (filepos)
REPEATABLE_READ_1	READ	meaningful	meaningful	data read (from the main file)
REPEATABLE_READ_1	READ_FOR_UPDATE	meaningful	meaningful	data read (from the main file)
READ_COMMITED	READ	-	-	-
READ_COMMITED	READ_FOR_UPDATE	meaningful	meaningful	- (only creates a lock)
DIRTY_READ	READ	-	-	-
DIRTY_READ	READ_FOR_UPDATE	meaningful	meaningful	- (only creates a lock)

READ requests' servicing when in transaction mode DIRTY_READ or READ_COMMITED does not create any geographical zone.
READ_FOR_UPDATE does what READ does; in addition READ_FOR_UPDATE behaves just like an update regarding the lock management in that it it is subject to failure because a concurrent transaction has previously set a lock (UPDATE or READ_FOR_UPDATE).

Result set geographical zone

typedef struct {
	struct MinNode zentete;
	TGZID gzid;
	mfpos_t zdebut;
	mfpos_t zfin;
	mfpos_t filepos;
	tatype atype;
	FILE *hfile;
	} rtzone;

zentete: linked node to result set list

gzid

Serves as a place holder for holding the gzid field of the tzone structure the data involved come from. A special value (LONG_MIN) means that the data originates from the main data file.

zdebut and zfin: result set geographical zone identity

Marks the logical beginning and ending of this part of the result set. The overall bounds (min(zdebut over the list) and max(zfin over the list) are stored in the result set definition zone.

atype

Serves as a place holder for holding the atype field of the tzone structure the data involved come from. When the data directly come from the main data file. the value of this field is simply READ.

hfile and filepos

These entries describe where the result set data can be retrieved from as part of a FETCH call servicing.

Data can be read from the logfile: hfile is the session's logfile handle and filepos is the offset from the beginning of the logfile where the data are stored.
Data do not exist: hfile is NULL and filepos is meaningless.
Data can be fetched from memory: hfile is -1 and filepos is the offset from the beginning of the temporary storage area where the data are stored (the base pointer of this area being the pdata field in the result set definition structure).

Client library API

The client library is a set of functions which formalize the communication between a client application and the engine.
This simplistic sentence should not hide the complexity buried in it. The transparency of the location and operating mode ("in process" or "separate process") of the engine and how to submit a request to it (transport layer to use) should be ensured at least partly by the client library (e.g. it should return data to the client application in a format usable by the latter; character set conversion is a relatively easy task to implement; if required, the translation of variable messages to the natural language of the user running the client application is another matter).
One chapter covers the core functions and the general mechanisms to call them and others; Another chapter describes specific features which are available for some hardware/software configurations only. When applicable, this chapter refers to sections in other documents describing the API specific to a platform (e.g. Transact PRM for RISC OS which describes the SWI interface to the Transact Module for Acorn computers). Currently and at this stage of the development:

the version number for the client library API will reflect that of the engine. However the version number may diverge as more middleware layers are being supported.
the client library is written in a programming language compatible with language of the client application. Two factors are considered here: The way functions/procedures call other functions/procedures in the language and the preferred format of representing strings in the language (think of Pascal ISO and ANSI C!); I assume that the representation of scalars do not vary from one language to another on the same hardware platform.

The figure below describes the logical architecture and the possible links between the components to be considered as part of the communication facility of the Transact engine.

1 and 1': These are the client applications written in a specific language and running on a specific hardware platform;
2 and 2': These are the client libraries specific to a programming language (they are not hardware specific libraries; that is the same library source code could be used for any existing hardware platform);
3 and 3': These are the libraries specific to a transport layer (they are specific to the runtime hardware too) (eg they may participate in the character set conversion mechanism).
4 and 4': These are the symmetrical components form the client side transport layer libraries but those reside on the server side;
5: The server library dispatcher is responsible for collecting the parameters before calling the many primitives of the server engine and initiating the return of results and messages back to the client application.

The physical model of the Transact engine which is specific to an implementation gives details about the operating system processus and the data memory spaces which are involved.
A model for implementing out of process servers will be proposed when I actually designed and coded one which works. The content of this document will reflect the results of this task.