Normalized Data

Brief History:

Relational Algebra was primarily developed by Edger F. Codd from 1969 to 1973, and primarily documented by Chris Date, both IBM employees. Codd also created his “12 rules” (really 13 as he started from zero) that were used to define the qualifications of a relational database management system (RDBMS). Codd’s work heavily influenced IBM’s first RDBMS called System R back in 1973. System R was created by Ray Boyce and Don Chamberlin.

System R introduced the Standard Query Language (SQL), originally called SEQUEL while in development, hence the reason we use to refer to SQL Server as “Sequal” Server. Codd and Boyce later teamed up to create the Boyce-Codd Normal Form, which is one step more confined than the 3rd Normal Form.

Why Normalize the Data?

Most developers claim that they do not like normalization because it makes querying data far more complex. In reality in the beginning stages of a large project querying is more complex. In the later stages it is far simpler and massively less error prone. The reasons are multi-fold:

Initially developers are matching data containers with web pages, which seem inherently logical. The problem becomes a concern as those same fields are utilized on several different web pages causing the data to be maintained in multiple locations creating a syncing problem. As an example, one Fortune 500 company was maintaining the final sales price in seven separate tables. This affected accounting to the point that they could never get their figures to reconcile for the SEC or the IRS eventually leading to being delisted and audited. The IRS took residence in their offices for over a year. Many other practical problems occur due to this method of data architecture such as with business intelligence, fulfilling orders, tracking, customer support, etc.
Database systems have a hard time when querying NULL fields. NULL fields occur only when the data is not normalized and means a column row intersection has no entry. The result of such queries can be significantly misleading and again potentially affect accounting, business intelligence fulfillment, etc. NOTE: All databases not normalized inherently have this problem without exception, by definition.
The old developer adage “you need to denormalize for performance” is fundamentally and drastically false as Samuel Berger formally proved conclusively in his presentation at the IDEAS data science conference held in October of 2019 at the L.A. Convention Center (https://www.youtube.com/watch?v=AOnilawQWys). Mr. Berger was able to prove formally using the most common data structure in the world, person names, that normalizing terabyte size data is demonstrably faster when normalized correctly. He loaded roughly 270,000 names into two databases. One was normalized and the other used the typical architecture for the person names information. When querying for all of the person names of those born on May 5th of any year from the typical denormalized data structure, 729,836 records were retrieved in 3 minutes 57.4 seconds. The same 729,836 records were retrieved from a correctly normalized data structure in just 1.5 seconds – 153.9 times faster.
What does that mean to a company with large data requirements? Potentially tens of millions of dollars to accommodate the same amount of data plus the fact that the queries are still going to be slow in retrieving from non-normalized data containers. So, the company will spend more and get far less.
Only Normalizing the data using Relational Algebra enforces relational integrity accurately. Relational integrity prevents orphaned records, duplicates, information is updated in only one place preventing errors, application changes do not break the data structures and vice versa so design changes are far easier and less expensive to implement, large data sets are massively faster to search and retrieve records, etc.

Relational Algebra:

Relational Algebra is a form of mathematics specific to data organization. Using Relational Algebra you gain the following advantages:

Eliminates orphaned records.
Isolates each data element in only one location in the entire database.
Requires virtually no instance of indexing for maximum performance. Indexing is a major overhead cost and re-indexing periodically requires downtime, in some cases for hours. Additionally, the most recent entries are not indexed, which are those most commonly queried (i.e., customer service helping a new customer) causing significant delays in loading customer data.
Math calculations are accurate and dependable.
Business Intelligence becomes meaningful and profitable.
Customer orders are not lost and information about an order becomes reliable.
Eliminates rebuilds of websites or applications normally occasioned by new releases. There is no need to be concerned about how data is displayed and related because the structure is based on the math only. This allows new applications with different ways of displaying and using the data to be created without changing the model and without disturbing its current uses.

All in all, Relational Algebra substantially helps performance, can massively lower hardware requirements, makes data reliable and meaningful, and gives your project longevity and versatility.

Locking Isolation:

NFSD utilizes advanced locking techniques. With proper locking of the data elements and being able to disconnect the user immediately upon retrieval of any information, you can handle easily 20 or more times the users with the same amount of hardware and yet never have deadlocks or users timing out. In fact, users can come back the next day if you wish their login not to expire due to inactivity (over which you have complete control) and continue exactly where they left off.

Table Level Locking:

All database systems assume that all records being queried are critical to remain in the state queried until the user completes their task. In order to accomplish this goal all of those records are locked. If another query comes in requiring the same data (a common problem) they have to wait until the lock is released. This is a major reason why most systems are slow.

Search engines typically avoid this by setting the transaction isolation level at “read uncommitted” thus making no locks; however, when a user requires data for a potential change or addition the data must be locked during the transaction. Grabbing the data without locking the data can cause what is known as a concurrency conflict (i.e., Bob and Jane are trying to change the same record at roughly the same time and as a consequence one overwrites the other without knowing it). So, the vast majority of queries are executed with default locking to avoid concurrency issues.

Table level locking allows us to avoid locking any table that we are not immediately modifying. This allows us to surgically strike and eliminate over 98% of unnecessary locks. The risk is still there for concurrency as we have retrieved the record without locking it. To properly deal with concurrency issues we have added the sophistication and significant power of logical level locking.

Logical Level Locking:

Logical level locking is where you keep track of the number of times a record has been updated and when you retrieve a record you retrieve with it that count. When you then try to make an update to that record a check is made to see if the record has been modified since you retrieved it. If it has been modified our method is to first check to see if the change is the same as the last one made. If it is, then we just update the person that made the update as well as the time updated. If it is not, we retrieve back the person’s name that updated the record last, the time it was updated and the exact fields that are in conflict. This way the user can contact the other user and resolve the discrepancy before the user makes their decision to override the other user or not. In practice concurrency conflicts are extremely rare, so this conflict almost never occurs. Avoiding this potential conflict is essential because it eliminates system performance degradation as well as employee/Client hindrances.

Client-Side Session Information Maintenance:

Session information is typically handled using one of three different methods. There are pros and cons of each.

Session information is maintained in the URL using long strings of cryptic numbers and letters after the base URL. This is the cheap and dirty method as it is not secure at all. This is also the most common method used in development today.
Session information is maintained at the server level causing a separate login each time the user times out or moves from one page to another. This method is a high overhead method and prone to problems of its own as it is highly complex and requires an additional database.
Session information is held in session objects. The problem here is the fact that those objects reside in resident memory and other applications can access this memory at the same time with their own session objects and session information. When two developers use the same naming conventions for a session object the data will be altered and cause a conflict or an error. This happens commonly enough that this method is seldom used in major applications.

Samuel Berger has developed a fourth method that is extremely secure, session specific, and maintains all session information on the client side. When combining this method with logical level locking our developed projects can immediately disconnect after retrieving client information allowing massive savings in resources and thus radically increasing the user to hardware ratio.

Normalized Data

Normalized Data

NFSD

The Math Works

Start Your Project