Data Models and Query Languages #
Most applications are built by layering one data model on top of another. Each layer hides the complexity of the layer below it by providing a clean data model.
Relational model v/s Document model #
SQL based on relational model proposed by Edgar Codd in 1970. Data is organized into relations (tables in SQL) where each relation is an unordered collection of tuples (rows in SQL).
Goal of relational model was to hide the implementation detail behind a cleaner interface.
Other competing alternatives that came and went -> Network model, Hierarchical model, XML DBs etc.
Birth of NoSQL #
The name NoSQL originated as a catch twitter hashtag for an OSS distributed DB meetup. Retroactively reinterpreted as Not Only SQL.
Driving forces behind NoSQL:
- Greater scalability
- Free and OSS preference over commercial DBs
- More dynamic data models
Object Relational Mismatch #
A translation layer is required between application code and the DB model. Difference between these models is called impedance mismatch.
Object Relational Mapping (ORM) frameworks reduce the boilerplate code required for this translation layer, but don’t completely hide the complexity.
Example for options to store a linkedin profile data in DB:
- Relational
- Save the varying info (education, companies, cities, etc) in separate tables with foreign key references to the users table
- Save this info in structured data types like XML or JSON within the same row with querying/indexing support
- Encode as XML or JSON and save in a text field column of same row, but this loses querying etc support, and the app needs to manage/interpret this data
- Document oriented DBs directly store all the information as a self-contained JSON doc. This reduces impedance mismatch but loses schema enforcement. It also provides better locality of all info of a user in one place. Also good for one to many relationships.
Many to one and Many to Many relationships #
Many to one relationships (e.g. many people in seattle) don’t fit nicely into the document model. In relational DBs, joins are easy but join support is weak in document DBs. And so, in such cases, application code might need to do the joins.
Hierarchical Model #
Very similar to JSON DBs. Represents all data as a tree of nested records. It worked well for one to many relationships but not for many to many, and also didn’t support joins.
Network Model #
Proposed by CODASYL to solve issues with hierarchical model, by allowing multiple parents for a record. These links were not through foreign keys though but through pointers, but still stored on disk. A path to access a record from root along these chains of links was called access path.
Due to multiple parents of a record, apps had to keep track of all the relationships and querying/updating the DB became very complex.
Relational Model #
Lays out all the data in the open and all the complexity of choosing the right indexes, ordering, etc is hidden by the query optimizer and app devs don’t have to think about it too much.
Comparison to Document DBs #
Document DBs are similar to hierarchical models in storing nested records for one to many relationships, while for representing many to many relationships, they are similar to relational DBs. In both cases, the related item is referenced by a unique ID called foreign key in relational model, and a document reference in document model. This ID is resolved at read time by joins or follow up queries.
Relational v/s Document DBs today: #
Pros of Document DBs:
- Schema Flexibility
- Better perf due to locality of data
- For some apps, closer to data structures used in code
Pros of Relational DBs:
- Better support for joins and complex queries
In document model, the need for joins can be reduced by:
- Denormalizing the data, but app needs to do extra work to keep the denormalized data consistent
- Making multiple requests to DB, but that moves the complexity to the app code and is also slower
Schema Flexibility in Document model #
Schema-on-read is a more accurate term than calling document DBs schema-less, because the code reading the data applies some implicity schema but it is not enforced by the DB.
Relational DBs have schema-on-write, where the schema is enforced by the DB. This is similar to dynamic v/s static type checking paradigms.
When there is a change in schema, typically in document DBs, the app code starts writing new data with new schema and adds a check to handle the old data differently. While in relational DBs, a migration may need to be performed on old data.
Schema-on-read approach is also useful for heterogenous data where the items in the collection have different schemas.
Data Locality #
Pros:
- Faster if large parts if document are required because don’t need to retrieve data from multiple tables, and it exists at the same place.
Cons:
- Need to load the whole data even if one needs a small portion.
Convergence of Relational and document DBs:
- Most relational DBs these days support XML and JSON.
- Relational DBs also have features to take advantage of locality by allowing to specify some tables’ rows to be interleaved within a parent table (Spanner/Oracle/BigTable etc allow this)
- Many Document DBS support joins, or automatically resolve document references.
Query Languages for Data #
Relational DBS also introduced SQL, a declarative language for querying data. Declarative means that you specify:
- Conditions to be met
- Transformations
- BUT, not how to achieve this (This is done by a query optimizer)
Declarative languages benefits:
- Easier to learn and use since they hide the implementation details of the DB engine.
- DB engine can change things in background for improvements without needing the clients to change.
- Easier to parallelize queries.
CSS on the web is an example of a declarative language.
Map Reduce Querying #
Map Reduce is a programming model for processing large amounts of data in bulk across many machines. A limited form is supported by some NoSQL DBs.
This is somewhere in between declarative and imperative languages, based on repeated calling of map and reduce functions to filter, collect & process.
For DB to run these functions anywhere, in any order and rerun on failure, they need to be pure function (i.e. use only passed-in data and not have any side effects).
Mongo DB has added support for a declarative query called aggregation pipeline which is a sequence of stages that process data records. Similar to a subset of SQL but looks more like JSON.
Graph like Data Models #
Useful when primarily many to many relationships exist in the data. A graph has:
- Vertices (Nodes or entities)
- Edges (Relationships between vertices, or arcs)
Graphs can have vertices of same kind (e.g. web pages on the internet) pr different kinds (E.g. Facebook has people, location, events, etc all as vertices).
Property Graphs #
Each vertex consists of:
- A unique ID
- Set of outgoing edges
- Set of incoming edges
- Set of properties (Key value pairs)
Each edge consists of:
- A unique ID
- Vertex at which edge starts (Tail vertex)
- Vertex at which edge ends (Head vertex)
- Set of properties (Key value pairs)
- Label to describe the relationship between the two vertices
In such aa scheme, it is easy to connect different kinds of data with each other and thus the graph can evolve easily to accommodate changes.
CYPHER was one of the declarative languages used for property graphs in Neo4j.
It is possible to represent this in a relational DB as well with 1 table for vertices and 1 table for edges. Thus SQL can be used to query the graph with the help of recursive common table expressions (WITH RECURSIVE) to express variable length traversal paths that may occur but it takes a lot more effort to do this.
This emphasizes the importance of picking the right data model for the problem at hand.
Triple Store Model and SPARQL #
Similar to property graphs, but all info is stored in form of 3-part statements -> Subject, Predicate, Object (E.g. Jim Likes Bananas)
Subject -> Vertex Object ->
- Can either be a vertex (E.g. B in
A married to B
) - Can be a value in a key-value pair where predicate is the key (E.g. 33 in
A's age 33
)
SPARQL is a query language for triple stores using RDF data model.
A few other discussions follow about Datalog and understanding its basics.