Show Me How: Build Streaming Data Pipelines for Real-Time Data Warehousing | Register Today
A database is an organized collection of information that is stored electronically so information can be maintained, accessed, and analyzed efficiently.
A Database Management System (DBMS) is a software program that is used to manage the database. Additionally, other applications and end-users interact with the database through the DBMS.
The DBMS and the data it manages together make up what's called a database system, often shortened to just “database”.
Database Management System (DBMS) is software that is used by developers to create, modify, retrieve, and maintain the information in the database. Database administrators (DBAs) use DBMS to control users’ access to the database as well as perform auditing functions for security purposes. DBMSs can provide wide range of capabilities, including:
Storing, retrieving, and modifying data
Managing access through Access Control Lists (ACL) and Role-Based Access Control (RBAC)
Safeguarding against data loss by simplifying data backups and snapshots processes along with providing recovery tools to fully or partially restore databases
Continuously monitoring the database to automatically tune performance or alert developers and administrators to take recommended actions
The database as we know it today dates back to the 1960s when use of computers became popular. Here are some of the main milestones in the history of databases
IBM computer scientist Edgar Codd published his paper “A Relational Model of Data for Large Shared Data Banks”. This was the paper that coined the term “relational database” and established a new way to store and access data. Following Codd’s paper, Michael Stonebraker and Eugene Wong at University of California in Berkeley created INGRES (Interactive Graphics and Retrieval System). INGRES was a relational database model that used QUEL query language. IBM released their version of a relational database called System R that used Structured Query Language (SQL) in 1974.
Relational databases grew in popularity and SQL became the standard language to use for databases.
The rise of the internet fueled the next round of growth in the database industry. The Relational Database Management System (RDBMS) model was designed to manage the data of a single organization, and wasn’t prepared to handle the volume of data that web applications were generating. Furthermore, with the decline in performance and increase in maintenance costs, developers look for a new solution and they find MySQL, an open source relational database.
NoSQL (not only structured query language) was initially coined in 1998 and referred to databases that used query languages other than SQL. However, as the internet continued to grow there was a need for a new kind of database that could store unstructured data. And that’s when NoSQL (not only SQL) databases as we know them today emerged. NoSQL databases became popular due to their speed working with large amounts of unstructured data.
In recent years organizations have increasingly been adopting cloud-native and purpose built databases. They are moving away from on-premises and legacy databases to cloud-native databases to improve agility, scalability, and decrease total cost of ownership. Learn more how Confluent can help you with your database modernization journey by visiting our solution page.
SQL is a programming language that is used to communicate with relational databases. The American National Standards Institute (ANSI) has considered SQL the standard language for relational database management systems. SQL statements are used to to add, remove, modify, and query data and they can also be used to grant permissions to users or roles. Popular RDBMS that use SQL are Oracle, Microsoft SQL Server, IBM, MySQL, PostgreSQL, Microsoft Access, Ingres, and more.
There are various types of databases that are designed and built for varying purposes. When choosing a database, it’s important to consider how the data will be used so you choose the best database for your use case.
A relational database is an organized collection of structured data that have a predefined relationship among them. In this database, rows (tuples) and columns (attributes) are used to store data, which together constitute a table. Each row in the database has a primary key which is a unique identifier that distinguishes it from other rows in that table. The primary key of one row could be stored as foreign key in another row in a different table to indicate the relationship between two tables. Relational databases are ideal solution for when the data is structured and has a pre-defined schema.
SQL: SQL is used to store, manipulate, and manage data in a relational database.
Transactions: A database transaction consists of one or more SQL statements and is considered a single unit of work that either is completed as a whole or not at all. In the relational database world, the result of a transaction is either COMMITED to the database or it’s a ROLLBACK.
ACID Compliance: Relational databases are primarily optimized for transaction operations. And in order to ensure data integrity, all transactions must be ACID compliant. ACID refers to Atomicity, Consistency, Isolation, and Durability.
To simplify, NoSQL refers to any database that doesn't use SQL as its primary data access language. NoSQL databases rose in popularity among developers in the late 2000s when the internet was on the rise and storage cost was decreasing significantly. Developers didn’t need to define complex data models. Rather, they had the flexibility to store any structured, semi-structured, or unstructured data. Each kind of NoSQL database has its own set of unique capabilities but overall they can be summarized as having:
Flexible schemas: Unlike their relational counterparts, NoSQL databases don’t require a schema in order to store data. Developers have the flexibility to store huge amounts of data and adapt quickly as application requirements change.
Ease of scalability: NoSQL databases are mostly distributed where several machines (nodes) work together as a single cluster. You can scale up by increasing resources or scale out by adding more nodes to the cluster. They can also replicate data to increase redundancy and improve availability. It’s easier and more cost efficient to scale NoSQL databases vs relational ones.
BASE: NoSQL databases often follow a more relaxed consistency model called BASE where relational database follow a more restricted one called ACID. BASE is the acronym for
Overall, the BASE model favors availability (since scalability is important) over consistency.
There are several types of NoSQL databases but the popular ones are Document based (MongoDB), Key-value (Redis), Columnar (Apache Cassandra), and Graph databases (Neo4J).
Key-value database provides the simplest possible data model as the name suggests. It stores information as a “key-value” pair. The key serves as a unique identifier and is used to retrieve the associated value. Both keys and values can be simple types like strings and numbers or complex objects. Key-value data stores are easy to use, highly scalable, and have low latency. Leaderboards, session stores, shopping carts, and caching are among popular use cases for this type of database. Redis, Amazon DynamoDB, and RocksDB are examples of key-value databases.
Document based databases are designed to store semi-structured data in JSON like objects. Each document has a unique identifier called key that is used to retrieve data from the database. The values can be of any type from strings, numbers to arrays and objects. Unlike key-value stores, document-based databases supports secondary indexes which can be used to retrieve data. Catalogs, mobile applications, and content management systems are among the most popular use cases for document databases. MongoDB and Apache CouchDB are examples of this type of database.
Columnar databases, also known as column-oriented databases, store data in columns rather than rows. Columnar databases optimize data for aggregate functions and operations on columns of data by storing the same type of data in one column. You can easily add more columns to the table without affecting other columns which makes access to data very efficient since querying runs on a subset of columns instead of scanning the entire table. Let’s say we want to quickly calculate average GPSs grouped by grade level. The GPA and Grade columns can be scanned very quickly to compute the averages because the data is stored on disk in a columnar fashion.
Data warehousing, business intelligence (BI), and big data processing are popular use cases for columnar or column-oriented databases. Apache Cassandra is an example of a columnar database.
Graph database (GDB) is a type of NoSQL database that is based on graph theory. A graph database stores entities as Nodes and relationships between them as Edges. To find the relationship between two nodes, you traverse along the edges and since relationships or edges are persisted in the database and not calculated at the query time the traversal is very quick. Graph databases are the perfect solution when you want to create or analyze relationships such as fraud detection, social networking, and recommendation engines. Neo4J is an example of a graph database.
Data warehouse is a central repository for all the data coming from disparate sources. They allow organizations to store historical and current data in a single location for the purpose of creating analytical reports. A data warehouse typically contains databases to store data, ETL tools to prepare data data for analysis, statistical analysis and reporting capabilities, dashboards, and visualizations tools.
A cloud database is a database that is deployed on a public or private cloud. It can be any type of database (relational or NoSQL) and can be offered either as a self-managed or fully-managed service. Cloud databases can help organizations to reduce total cost of ownership, improve reliability and resiliency, and have faster time to market for new use cases.
In a distributed database, data is stored across multiple machines (physical or virtual) that are located in different (physical or logical) locations and connected through a network (intranet or internet). Network latency is often the biggest component in web application response time. This means the farthest end user is from a data center, the longer it takes for the application to respond to their actions. Distributed databases can help improve end user experience since they can be deployed in different geographical locations hence decreasing the network latency.
Relational and NoSQL databases were designed to store the current state of the world in the form of tables, and to answer questions about this state via point-in-time queries against the tables: “What is Alice’s account balance right now?”. In other words, there is no way to determine from a table alone how that state originated—think: “How did we get here?” This is where streams come into play. Event streams model the world around us. By storing individual data records (events) non-destructively as an ordered sequence, they provide a detailed historical view of what happened. Here’s an analogy to playing chess: If you have a recording of all the moves on the chessboard, it is straightforward to know what the current state of the board is. You can also easily rewind the time, using the recorded stream of moves, to know all of the historical states of the board.
To work with data in our systems we need both streams and tables. But so far, technologies have supported one or the other, forcing a wedge between what are two naturally complementary ways of handling data. We could either have state, or we could have events, despite the many benefits of being able to use the semantics of both in one place.
That’s why Confluent developed ksqlDB, the database purpose-built for stream processing applications. ksqlDB offers a single solution for collecting streams of data, enriching them, and serving queries on new derived streams and tables. That means less infrastructure to deploy, maintain, scale, and secure. With less moving parts in your data architecture, you can focus on what really matters -- innovation.
ksqlDB is a database that's purpose-built for stream processing applications. It consolidates the many components found in virtually every stream processing architecture.
That is important because almost all streaming architectures today are piecemeal solutions cobbled together from different projects. At a minimum, you need a subsystem to acquire events from existing data sources, another to store them, another to process them, and another to serve queries against aggregated materializations. Integrating each subsystem can be difficult. Each has its own mental model. And it’s easy to wonder, given all this complexity: Is this all worth it?
With ksqlDB, the complex architecture once required to build stream processing applications is reduced to two components: ksqlDB and Kafka. Rather than needing to work across several complex systems, enterprises can leverage ksqlDB to unlock real-time business insights and rich customer experiences with just a few SQL statements. With fewer moving parts in the underlying architecture, developers can more easily build stream processing applications and get them to market faster, while platform operators can scale, monitor, and secure their architecture with a more uniform approach. Perhaps best of all, ksqlDB and Kafka are offered as a fully managed service on Confluent Cloud, enabling you to entirely offload the management of your stream processing architecture.