Normalized Data
Relational Algebra was primarily developed by Edger F. Codd from 1969 to 1973, and primarily documented by Chris Date, both IBM employees. Codd also created his “12 rules” (really 13 as he started from zero) that were used to define the qualifications of a relational database management system (RDBMS). Codd’s work heavily influenced IBM’s first RDBMS called System R back in 1973. System R was created by Ray Boyce and Don Chamberlin.
System R introduced the Standard Query Language (SQL), originally called SEQUEL while in development, hence the reason we use to refer to SQL Server as “Sequal” Server. Codd and Boyce later teamed up to create the Boyce-Codd Normal Form, which is one step more confined than the 3rd Normal Form.
Why Normalize the Data?
Most developers claim that they do not like normalization because it makes querying data far more complex. In reality in the beginning stages of a large project querying is more complex. In the later stages it is far simpler and massively less error prone. The reasons are multi-fold:
- Initially developers are matching data containers with web pages, which seem inherently logical. The problem becomes a concern as those same fields are utilized on several different web pages causing the data to be maintained in multiple locations creating a syncing problem. As an example, one Fortune 500 company was maintaining the final sales price in seven separate tables. This affected accounting to the point that they could never get their figures to reconcile for the SEC or the IRS eventually leading to being delisted and audited. The IRS took residence in their offices for over a year. Many other practical problems occur due to this method of data architecture such as with business intelligence, fulfilling orders, tracking, customer support, etc.
- Database systems have a hard time when querying NULL fields. NULL fields occur only when the data is not normalized and means a column row intersection has no entry. The result of such queries can be significantly misleading and again potentially affect accounting, business intelligence fulfillment, etc. NOTE: All databases not normalized inherently have this problem without exception, by definition.
-
The old developer adage “you need to denormalize for performance” is fundamentally and drastically false as Samuel Berger formally proved conclusively in his presentation at the IDEAS data science conference held in October of 2019 at the L.A. Convention Center (https://www.youtube.com/watch?v=AOnilawQWys). Mr. Berger was able to prove formally using the most common data structure in the world, person names, that normalizing terabyte size data is demonstrably faster when normalized correctly. He loaded roughly 270,000 names into two databases. One was normalized and the other used the typical architecture for the person names information. When querying for all of the person names of those born on May 5th of any year from the typical denormalized data structure, 729,836 records were retrieved in 3 minutes 57.4 seconds. The same 729,836 records were retrieved from a correctly normalized data structure in just 1.5 seconds – 153.9 times faster.
What does that mean to a company with large data requirements? Potentially tens of millions of dollars to accommodate the same amount of data plus the fact that the queries are still going to be slow in retrieving from non-normalized data containers. So, the company will spend more and get far less. - Only Normalizing the data using Relational Algebra enforces relational integrity accurately. Relational integrity prevents orphaned records, duplicates, information is updated in only one place preventing errors, application changes do not break the data structures and vice versa so design changes are far easier and less expensive to implement, large data sets are massively faster to search and retrieve records, etc.
Relational Algebra:
Relational Algebra is a form of mathematics specific to data organization. Using Relational Algebra you gain the following advantages:
- Eliminates orphaned records.
- Isolates each data element in only one location in the entire database.
- Requires virtually no instance of indexing for maximum performance. Indexing is a major overhead cost and re-indexing periodically requires downtime, in some cases for hours. Additionally, the most recent entries are not indexed, which are those most commonly queried (i.e., customer service helping a new customer) causing significant delays in loading customer data.
- Math calculations are accurate and dependable.
- Business Intelligence becomes meaningful and profitable.
- Customer orders are not lost and information about an order becomes reliable.
- Eliminates rebuilds of websites or applications normally occasioned by new releases. There is no need to be concerned about how data is displayed and related because the structure is based on the math only. This allows new applications with different ways of displaying and using the data to be created without changing the model and without disturbing its current uses.
All in all, Relational Algebra substantially helps performance, can massively lower hardware requirements, makes data reliable and meaningful, and gives your project longevity and versatility.
Locking Isolation:
NFSD utilizes advanced locking techniques. With proper locking of the data elements and being able to disconnect the user immediately upon retrieval of any information, you can handle easily 20 or more times the users with the same amount of hardware and yet never have deadlocks or users timing out. In fact, users can come back the next day if you wish their login not to expire due to inactivity (over which you have complete control) and continue exactly where they left off.
- Session information is maintained in the URL using long strings of cryptic numbers and letters after the base URL. This is the cheap and dirty method as it is not secure at all. This is also the most common method used in development today.
- Session information is maintained at the server level causing a separate login each time the user times out or moves from one page to another. This method is a high overhead method and prone to problems of its own as it is highly complex and requires an additional database.
- Session information is held in session objects. The problem here is the fact that those objects reside in resident memory and other applications can access this memory at the same time with their own session objects and session information. When two developers use the same naming conventions for a session object the data will be altered and cause a conflict or an error. This happens commonly enough that this method is seldom used in major applications.