MongoDB (A NoSQL Database)

Anirban Dey
5 min readMay 11, 2021

MongoDB is a general purpose, document-based, distributed database built for modern application developers and for the cloud era. It uses JSON-like documents with optional schemas.

Few other points………………………

** A file system is a way to store and manage the files and data in a storage
** Data models are made up of entities, which are the objects or concepts we want to track data about, and they become the tables in a database. It is used to define how the logical structure of a database is modeled
** SQL databases are relational, uses structured query language and have a predefined schema, vertically scalable, table based and better for multi-row transactions while NoSQL databases are non-relational, have dynamic schemas for unstructured data (i.e., schemaless), are horizontally scalable, are document, key-value, graph/wide-column stores and better for unstructured data like documents or JSON
** MS Excel is a software from Microsoft that is used to manage relational data & provides functionality to create the graphs from them, but it doesn’t have the capability to query the data which the SQL & NoSQL databases provide
** The insert operation is used to add a record or data into a database/table
** Schemaless DB are those databases where we don’t need to create column names (i.e., no fixed schema). It’s helpful when we have heterogenous data & we mainly use key-value pairs to store these data
** A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data. Document-oriented databases are inherently a subclass of the key-value store, another NoSQL database concept
** To configure MongoDB server download the software from MongoDB official website and then run the “mongod” command
** JSON (JavaScript Object Notation) is a lightweight, key-value based data-interchange format. It is easy for humans to read and write and also easy for machines to generate,parse and process
** CRUD is the most basic feature provided by mongodb where we can perform Create, Read, Update and Delete operations on our data
** Compass is a GUI tool for mongodb where we just install the package and provide the connection string to connect with our mongodb server and work on it
** To upload dataset in MongoDB we use “mongoimport” tool
** To integrate MongoDB API with python we use “pymongo” library of python
** Indexes are special data structures that store a small portion of the collection’s data set in an easy to traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field. The ordering of the index entries supports efficient equality matches and range-based query operations.
(Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.)
** Primary key is an attribute that provides unique identification for our documents. In MongoDB, _id field is the primary key for the collection so that each document can be uniquely identified in the collection. The _id field contains a unique ObjectID value

** Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations
** A replica set in MongoDB is a group of mongod processes that maintain the same data set. Replica sets provide redundancy and high availability, and are the basis for all production deployments
** COLLSCAN is used for a collection scan while IXSCAN is used for scanning index keys
** A compound index is an index that contains references to multiple fields within a document. The order of fields listed in a compound index has significance
** The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into aggregated results. (The MongoDB aggregation pipeline consists of stages. Each stage transforms the documents as they pass through the pipeline. Pipeline stages do not need to produce one output document for every input document. For example, some stages may generate new documents or filter out documents. The pipeline provides efficient data aggregation using native operations within MongoDB)
** MongoDB mongos instances route queries and write operations to shards in a sharded cluster. mongos provide the only interface to a sharded cluster from the perspective of applications. Applications never connect or communicate directly with the shards. (The mongos tracks what data is on which shard by caching the metadata from the config servers. The mongos uses the metadata to route operations from applications and clients to the mongod instances.)
** MongoDB applications use one of two methods for relating documents:
Manual references where we save the _id field of one document in another document as a reference. Then our application can run a second query to return the related data. These references are simple and sufficient for most use cases
DBRefs are references from one document to another using the value of the first document’s _id field, collection name, and, optionally, its database name. By including these names, DBRefs allow documents located in multiple collections to be more easily linked with documents from a single collection
** MongoDB Atlas is a fully-managed cloud database developed by the same people that build MongoDB. Atlas handles all the complexity of deploying, managing, and healing our deployments on the cloud service provider of our choice (AWS, Azure and GCP)
** A sharded cluster in MongoDB is a collection of datasets distributed across many shards (servers) in order to achieve horizontal scalability and better performance in read and write operations. (Sharding is very useful for collections that have a lot of data and high query rates.)

--

--