Monday, April 8, 2019


                                    Data Persistence
Role of data in information systems

At the most basic level, an information system (IS) is a set of components that work together to manage data processing and storage. Its role is to support the key aspects of running an organization, such as communication, record-keeping, decision making, data analysis and more. Companies use this information to improve their business operations, make strategic decisions and gain a competitive edge.
Information systems typically include a combination of software, hardware and telecommunication networks. For example, an organization may use customer relationship management systems to gain a better understanding of its target audience, acquire new customers and retain existing clients. This technology allows companies to gather and analyze sales activity data, define the exact target group of a marketing campaign and measure customer satisfaction.


Need for data persistence

Understanding the meaning of persistence is important for evaluating different data store systems. Given the importance of the data store in most modern applications, making a poorly informed choice could mean substantial downtime or loss of data. In this post, we'll discuss persistence and data store design approaches and provide some background on these in the context of Cassandra.
Persistence is "the continuance of an effect after its cause is removed". In the context of storing data in a computer system, this means that the data survives after the process with which it was created has ended. In other words, for a data store to be considered persistent, it must write to non-volatile storage.
If you need persistence in your data store, then you need to also understand the four main design approaches that a data store can take and how (or if) these designs provide persistence:
·         Pure in-memory, no persistence at all, such as me cached or Scalars’
·         In-memory with periodic snapshots, such as Oracle Coherence or Redis
·         Disk-based with update-in-place writes, such as MySQL ISAM or MongoDB
·         Commitlog-based, such as all traditional OLTP databases (Oracle,
SQL Server, etc.)


Data

In computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today's computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format.

Database

database is an organized collection of data, generally stored and accessed electronically from a computer system. Where databases are more complex they are often developed using formal design and modeling techniques.

Database Server


The term database server may refer to both hardware and software used to run a database, according to the context. As software, a database server is the back-end portion of a database application, following the traditional client-server model. This back-end portion is sometimes called the instance. It may also refer to the physical computer used to host the database. When mentioned in this context, the database server is typically a dedicated higher-end computer that hosts the database.
Note that the database server is independent of the database architecture. Relational databases, flat files, non-relational databases: all these architectures can be accommodated on database servers.

Database Management System

A database management system (DBMS) is system software for creating and managing databases. The DBMS provides users and programmers with a systematic way to create, retrieve, update and manage data.
A DBMS makes it possible for end users to create, read, update and delete data in a database. The DBMS essentially serves as an interface between the database and end users or application programs, ensuring that data is consistently organized and remains easily accessible.
The DBMS manages three important things: the data, the database engine that allows data to be accessed, locked and modified -- and the database schema, which defines the database’s logical structure. These three foundational elements help provide concurrency, security, data integrity and uniform administration procedures. Typical database administration tasks supported by the DBMS include change management, performance monitoring/tuning and backup and recovery. Many database management systems are also responsible for automated rollbacks, restarts and recovery as well as the logging and auditing of activity.
Files and Databases
File System

Pros of the File System

  • Performance can be better than when you do it in a database. To justify this, if you store large files in DB, then it may slow down the performance because a simple query to retrieve the list of files or filename will also load the file data if you used Select * in your query. In a files ystem, accessing a file is quite simple and light weight.
  • Saving the files and downloading them in the file system is much simpler than it is in a database since a simple "Save As" function will help you out. Downloading can be done by addressing a URL with the location of the saved file.
  • Migrating the data is an easy process. You can just copy and paste the folder to your desired destination while ensuring that write permissions are provided to your destination.
  • It's cost effective in most cases to expand your web server rather than pay for certain databases.
  • It's easy to migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future.

Cons of the File System

  • Loosely packed. There are no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee. Consider a scenario in which your files are deleted from the location manually or by some hacking dudes. You might not know whether the file exists or not. Painful, right?
  • Low security. Since your files can be saved in a folder where you should have provided write permissions, it is prone to safety issues and invites trouble, like hacking. It's best to avoid saving in the file system if you cannot afford to compromise in terms of security.

Database

Pros of Database

  • ACID consistency, which includes a rollback of an update that is complicated when files are stored outside the database.
  • Files will be in sync with the database and cannot be orphaned, which gives you the upper hand in tracking transactions.
  • Backups automatically include file binaries.
  • It's more secure than saving in a file system.

Cons of Database

  • You may have to convert the files to blob in order to store them in the database.
  • Database backups will be more hefty and heavy.
  • Memory is ineffective. Often, RDBMSs are RAM-driven, so all data has to go to RAM first. Yeah, that’s right. Have you ever thought about what happens when an RDBMS has to find and sort data? RDBMS tracks each data page — even the lowest amount of data read and written — and it has to track if it’s in-memory or if it’s on-disk, if it’s indexed or if it's sorted physically etc.
Different arrangements of data

Data arrangement

•Un-structured
 •Semi-structured
•Structured

Different types of databases


The different types of databases include operational databases, end-user databases, distributed databases, analytical databases, relational databases, hierarchical databases and database models. Databases are classified according to their type of content, application area and technical aspect. For instance, a deductive database combines logic programming with a relational database, while a graph database uses graph structures to represent and store information.
Other types of databases include hypertext databases, mobile databases, parallel databases, active databases, cloud databases, in-memory databases, spatial databases, temporal databases, real-time databases, probabilistic databases and embedded databases.
A database is an organized collection of data. Its primary function is to interact with a database management system to capture and analyze data. A database management system is a software system designed to allow the creation, querying and administration of databases. Some popular database management systems include PostgreSQL, MySQL, Microsoft SQL Server, Oracle, IBM DB2 and SAP.
Databases are designed to operate large amounts of information by inputting, storing, retrieving and managing it. They are set up in a way that allows users to easily and intuitively gain access to all the information. A database management maintains the integrity and security of stored data. It is also used for data recovery, in case of system failure.
Warehouse with Big data

For decades, the enterprise data warehouse (EDW) has been the asp rational analytic system for just about every organization. It has taken many forms throughout the enterprise, but all share the same core concepts of integration/consolidation of data from disparate sources, governing that data to provide reliability and trust, and enabling reporting and analytics. A successful EDW implementation can drastically reduce IT staff bottlenecks and resource requirements, while empowering and streamlining data access for both technical and nontechnical users.
The last few years, however, have been very disruptive to the data management landscape. What we refer to as the “big data” era has introduced new technologies and techniques that provide alternatives to the traditional EDW approach, and in many cases, exceeding its capabilities. Many claim we are now in a post-EDW era and the concept itself is legacy. We position the EDW as a sound concept, however, one that needs to evolve.

Database Management System (DBMS) and Its Applications:

A Database management system is a computerized record-keeping system. It is a repository or a container for collection of computerized data files. The overall purpose of DBMS is to allow he users to define, store, retrieve and update the information contained in the database on demand. Information can be anything that is of significance to an individual or organization.
SQL statements
• Execute standard SQL statements from the application
Statement stmt = con.createStatement();
stmt.executeUpdate(“update STUDENT set NAME =” + name + “ where ID =” + id + “)”;
Prepared statements
 •The query only needs to be parsed (or prepared) once, but can be executed multiple times with the same or different parameters.
PreparedStatement pstmt = con.prepareStatement("update STUDENT set NAME = ? where ID = ?");
pstmt.setString(1, "MyName"); pstmt.setInt(2, 111); pstmt.executeUpdate();
Callable statements
Execute stored procedures
CallableStatement cstmt = con.prepareCall("{call anyProcedure(?, ?, ?)}"); cstmt.execute();
OBJECT RELATIONAL MAPPING
There are different structures for holding data at runtime
 •Application holds data in objects
•Database uses tables (entities)
•How to map data in objects to the tables?
Object Relational Mapping (ORM)
·         If you’re going to use ORM, you should make your model objects as simple as possible. Be more vigilant about simplicity to make sure your model objects really are just Plain ol’ Data. Otherwise you may end up wrestling with your ORM to make sure the persistence works like you expect it to, and it’s not looking for methods and properties that aren’t actually there.
·         If you’re not going to use ORM, you should probably define DAOs or persistence and query methods to avoid coupling the model layer with the persistence layer. Otherwise you end up with SQL in your model objects and a forced dependency on your project.
Beans use POJO
POJO stands for Plain Old Java Object. It is an ordinary Java object, not bound by any special restriction other than those forced by the Java Language Specification and not requiring any class path. POJOs are used for increasing the readability and re-usability of a program. POJOs have gained most acceptance because they are easy to write and understand. They were introduced in EJB 3.0 by Sun Microsystems.
A POJO should not:
•Extend pre-specified classes.
•Implement pre-specified interfaces.
 •Contain pre-specified annotations.
Beans
• Beans are special type of Pojos. There are some restrictions on POJO to be a bean .
 • All JavaBeans are POJOs but not all POJOs are JavaBeans.
 • Serializable i.e. they should implement Serializable interface. Still some POJOs who don’t implement Serializable interface are called POJOs because Serializable is a marker interface and therefore not of much burden.
 • Fields should be private. This is to provide the complete control on fields.
• Fields should have getters or setters or both.
 • A no-arg constructor should be there in a bean.
 • Fields are accessed only by constructor or getter setters.
Java Persistence API (JPA)
•An API/specification for ORM
•Uses
 •POJO classes
 •XML based mapping file (represent the DB)
 •A provider (implementation of JPA)
JPA implementations
 •Hybernate
 •JDO
•EclipseLink
•ObjectDB
                                ORM TOOLS

Object Relational Mapping (ORM) Tools provide a slick way of persisting objects (data) to a database. I personally don't know much about the history of ORM tools but I will vouch for the power they bring to a project. ORM tools are not for detailed oriented programmers that have to know all the internels of how things work. A good ORM implementation should be black-boxed so that once you understand what they provide you with you should not care so much how it works...only that it does. I won't lie...it's a leap of faith that was even hard for me to make.
So what do ORM tools do? Different ORM tools do different things but generally you can expect the following.
1.  ORM tools are made database aware through use of some database abstraction layer...typically outside the scope of the ORM tool (i.e. JDBC, ODBC or in PHP PEAR::DB).
2.  ORM tools can produce pure PHP model classes where you generally have one model object per database table.
3.  ORM tools can produce database schema's for the various database management systems supported
4.  ORM tools greatly reduce the need the developers having to write SQL. Why in the world would you want that? Well, developers can spend an inordinate amount of time getting a SQL statement right within their code. The resulting SQL has no gaurantee that it will adhere to SQL standards so that it can be used across DBMSs. A good example of such things is the Geeklog 1.3.x where we have the REPLACE INTO statements which are specific to MySQL. The other huge benefit is by reducing the need to write SQL, the developer can concentrate on innovation. How does this work, you ask. Well, in the example of Propel, if you have, let's say, a story object...to save it you simply issue $myStoryObj->save(). Similar methods exist for retrieval and deletion.
5.  ORM tools honor complex relationships. Thus if you have an object that has child objects on it (i.e. a customer who can have many addresses) when you issue a save on the parent, the ORM tool is smart enough to save all objects and will ensure they are wrapped in a transaction.

Not Only SQL (NOSQL)

Relational DBs are good for structured data
•For semi-structured and un-structured data, some other types of DBs can be used
•Key-value stores
               •Document databases
               •Wide-column stores
               •Graph stores

Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address:
•Large volumes of rapidly changing structured, semi-structured, and unstructured data

NoSQL DB servers

MongoDB
•Cassandra
•Redis
•Amazon DynamoDB
•Hbase

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
•It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop core concepts

• Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data
• Hadoop YARN: A framework for job scheduling and cluster resource management.
 • Hadoop Map Reduce: A YARN-based system for parallel processing of large data sets.

INFORMATION RETRIEVAL (IR)

Data in the storages should be fetched, converted into information, and produced for proper use •Information is retrieved via search queries
• Keyword search
• Full-text search
•The output can be
 • Text
• Multimedia
The information retrieval process should be
• Fast/performance
•Scalable
Efficient
•Reliable/Correct

Major implementations



 • Elasticsearch
• Solr
• Mainly used in search engines and recommendation systems, with ranking
• Additionally may use
 • Natural language processing
• AI/Machine learning
 • Ranking










No comments:

Post a Comment