Role of data in information
systems
At the most basic level, an information system (IS) is a
set of components that work together to manage data processing and storage. Its
role is to support the key aspects of running an organization, such as
communication, record-keeping, decision making, data analysis and more.
Companies use this information to improve their business operations, make
strategic decisions and gain a competitive edge.
Information systems typically include a combination of
software, hardware and telecommunication networks. For example, an organization
may use customer relationship management systems to gain a better understanding
of its target audience, acquire new customers and retain existing clients. This
technology allows companies to gather and analyze sales activity data, define
the exact target group of a marketing campaign and measure customer
satisfaction.
Need for data persistence
Understanding the meaning of persistence is important for
evaluating different data store systems. Given the importance of the data store
in most modern applications, making a poorly informed choice could mean
substantial downtime or loss of data. In this post, we'll discuss persistence
and data store design approaches and provide some background on these in the
context of Cassandra.
Persistence is "the continuance of an effect after its cause
is removed". In the context of storing data in a computer system, this
means that the data survives after the process with which it was created has
ended. In other words, for a data store to be considered persistent, it must
write to non-volatile storage.
If you need persistence in your data store, then you need to also
understand the four main design approaches that a data store can take and how
(or if) these designs provide persistence:
·
Pure in-memory, no persistence at all, such as
me cached or Scalars’
·
In-memory with periodic snapshots, such as
Oracle Coherence or Redis
·
Disk-based with update-in-place writes, such as
MySQL ISAM or MongoDB
·
Commitlog-based, such as all traditional OLTP
databases (Oracle,
SQL Server, etc.)
SQL Server, etc.)
Data
In computing, data is information that has been
translated into a form that is efficient for movement or processing. Relative
to today's computers and transmission
media, data is information converted into binary digital form. It is
acceptable for data to be used as a singular subject or a plural subject. Raw
data is a term used to describe data in its most basic
digital format.
Database
A database is
an organized collection of data, generally stored and
accessed electronically from a computer system. Where databases are more
complex they are often developed using formal design
and modeling techniques.
Database Server
The
term database server may refer to both hardware and software used to run a
database, according to the context. As software, a database server is the
back-end portion of a database application, following the traditional
client-server model. This back-end portion is sometimes called the instance. It
may also refer to the physical computer used to host the database. When
mentioned in this context, the database server is typically a dedicated
higher-end computer that hosts the database.
Note
that the database server is independent of the database architecture.
Relational databases, flat files, non-relational databases: all these
architectures can be accommodated on database servers.
Database Management System
A
database management system (DBMS) is system software for creating and
managing databases. The DBMS provides users and programmers
with a systematic way to create, retrieve, update and manage data.
A DBMS
makes it possible for end users to create, read, update and delete data in a database. The DBMS essentially
serves as an interface between the database and end users or application programs, ensuring that data is consistently
organized and remains easily accessible.
The
DBMS manages three important things: the data, the database engine that allows data to be accessed, locked and modified --
and the database schema, which defines the database’s logical
structure. These three foundational elements help provide concurrency, security, data integrity and uniform administration
procedures. Typical database administration tasks supported by the DBMS
include change management, performance monitoring/tuning and backup and recovery. Many database management systems are also
responsible for automated rollbacks, restarts and recovery as well as
the logging and auditing of activity.
Files and Databases
File System
Pros of
the File System
- Performance can be better
than when you do it in a database. To
justify this, if you store large files in DB, then it may slow down the
performance because a simple query to retrieve the list of files or
filename will also load the file data if you used
Select *
in your query. In a files ystem, accessing a file is quite simple and light weight. - Saving the files and
downloading them in the file system is much simpler than it is in a database since a
simple "Save As" function will help you out. Downloading can be
done by addressing a URL with the location of the saved file.
- Migrating the data is an
easy process. You can just copy
and paste the folder to your desired destination while ensuring that write
permissions are provided to your destination.
- It's cost effective in most cases to expand your web
server rather than pay for certain databases.
- It's easy to migrate it to
cloud storage i.e. Amazon
S3, CDNs, etc. in the future.
Cons of
the File System
- Loosely packed. There are no ACID (Atomicity,
Consistency, Isolation, Durability) operations in relational mapping,
which means there is no guarantee. Consider a scenario in which your files
are deleted from the location manually or by some hacking dudes. You might
not know whether the file exists or not. Painful, right?
- Low security. Since your files can be saved in a
folder where you should have provided write permissions, it is prone to
safety issues and invites trouble, like hacking. It's best to avoid saving
in the file system if you cannot afford to compromise in terms of
security.
Database
Pros of
Database
- ACID consistency, which includes a rollback of an update
that is complicated when files are stored outside the database.
- Files
will be in sync with the database and
cannot be orphaned, which gives you the upper hand in tracking
transactions.
- Backups
automatically include file binaries.
- It's
more secure than saving
in a file system.
Cons of
Database
- You may have to convert the
files to blob in order to
store them in the database.
- Database backups will be
more hefty and heavy.
- Memory is ineffective. Often, RDBMSs are RAM-driven, so all
data has to go to RAM first. Yeah, that’s right. Have you ever thought
about what happens when an RDBMS has to find and sort data? RDBMS tracks
each data page — even the lowest amount of data read and written — and it
has to track if it’s in-memory or if it’s on-disk, if it’s indexed or if
it's sorted physically etc.
Different arrangements of data
Data arrangement
•Un-structured
•Semi-structured
•Structured
The different types
of databases include operational databases, end-user databases, distributed
databases, analytical databases, relational databases, hierarchical databases
and database models. Databases are classified according
to their type of content, application area and technical aspect. For instance,
a deductive database combines logic programming with a relational database,
while a graph database uses graph structures to represent and store
information.
Other types of databases include
hypertext databases, mobile databases, parallel databases, active databases,
cloud databases, in-memory databases, spatial databases, temporal databases,
real-time databases, probabilistic databases and embedded databases.
A database is an organized collection of
data. Its primary function is to interact with a database management system to
capture and analyze data. A database management system is a software system
designed to allow the creation, querying and administration of databases. Some
popular database management systems include PostgreSQL, MySQL, Microsoft SQL
Server, Oracle, IBM DB2 and SAP.
Databases are designed to operate large
amounts of information by inputting, storing, retrieving and managing it. They
are set up in a way that allows users to easily and intuitively gain access to
all the information. A database management maintains the integrity and security
of stored data. It is also used for data recovery, in case of system failure.
For decades, the enterprise
data warehouse (EDW) has been the asp rational analytic system for just about
every organization. It has taken many forms throughout the enterprise, but all
share the same core concepts of integration/consolidation of data from
disparate sources, governing that data to provide reliability and trust, and
enabling reporting and analytics. A successful EDW implementation can
drastically reduce IT staff bottlenecks and resource requirements, while
empowering and streamlining data access for both technical and nontechnical
users.
The last few years,
however, have been very disruptive to the data management landscape. What we
refer to as the “big data” era has introduced new technologies and techniques
that provide alternatives to the traditional EDW approach, and in many cases,
exceeding its capabilities. Many claim we are now in a post-EDW era and the
concept itself is legacy. We position the EDW as a sound concept, however, one
that needs to evolve.
Database Management System (DBMS) and Its Applications:
A Database management system is a computerized
record-keeping system. It is a repository or a container for collection of
computerized data files. The overall purpose of DBMS is to allow he users to
define, store, retrieve and update the information contained in the database on
demand. Information can be anything that is of significance to an individual or
organization.
SQL statements
•
Execute standard SQL statements from the application
Statement
stmt = con.createStatement();
stmt.executeUpdate(“update
STUDENT set NAME =” + name + “ where ID =” + id + “)”;
Prepared statements
•The query only needs to be parsed (or
prepared) once, but can be executed multiple times with the same or different
parameters.
PreparedStatement
pstmt = con.prepareStatement("update STUDENT set NAME = ? where ID =
?");
pstmt.setString(1,
"MyName"); pstmt.setInt(2, 111); pstmt.executeUpdate();
Callable statements
•Execute
stored procedures
CallableStatement
cstmt = con.prepareCall("{call anyProcedure(?, ?, ?)}");
cstmt.execute();
OBJECT RELATIONAL MAPPING
There
are different structures for holding data at runtime
•Application holds data in objects
•Database
uses tables (entities)
•How
to map data in objects to the tables?
Object Relational Mapping (ORM)
·
If
you’re going to use ORM, you should make your model objects as simple as
possible. Be more vigilant about simplicity to make sure your model objects
really are just Plain ol’ Data. Otherwise you may end up wrestling with your
ORM to make sure the persistence works like you expect it to, and it’s not
looking for methods and properties that aren’t actually there.
·
If
you’re not going to use ORM, you should probably define DAOs or persistence and
query methods to avoid coupling the model layer with the persistence layer.
Otherwise you end up with SQL in your model objects and a forced dependency on
your project.
Beans use POJO
POJO
stands for Plain Old Java Object. It is an ordinary Java object, not bound by
any special restriction other than those forced by the Java Language Specification
and not requiring any class path. POJOs are used for increasing the readability
and re-usability of a program. POJOs have gained most acceptance because they
are easy to write and understand. They were introduced in EJB 3.0 by Sun
Microsystems.
A
POJO should not:
•Extend
pre-specified classes.
•Implement
pre-specified interfaces.
•Contain pre-specified annotations.
Beans
•
Beans are special type of Pojos. There are some restrictions on POJO to be a
bean .
• All JavaBeans are POJOs but not all POJOs
are JavaBeans.
• Serializable i.e. they should implement
Serializable interface. Still some POJOs who don’t implement Serializable
interface are called POJOs because Serializable is a marker interface and
therefore not of much burden.
• Fields should be private. This is to provide
the complete control on fields.
•
Fields should have getters or setters or both.
• A no-arg constructor should be there in a
bean.
• Fields are accessed only by constructor or
getter setters.
Java
Persistence API (JPA)
•An
API/specification for ORM
•Uses
•POJO classes
•XML based mapping file (represent the DB)
•A provider (implementation of JPA)
•Hybernate
•JDO
•EclipseLink
•ObjectDB
Object Relational
Mapping (ORM) Tools provide a slick way of persisting objects (data) to a
database. I personally don't know much about the history of ORM tools but I
will vouch for the power they bring to a project. ORM tools are not for
detailed oriented programmers that have to know all the internels of how things
work. A good ORM implementation should be black-boxed so that once you
understand what they provide you with you should not care so much how it
works...only that it does. I won't lie...it's a leap of faith that was even
hard for me to make.
So what do ORM tools do?
Different ORM tools do different things but generally you can expect the
following.
1. ORM
tools are made database aware through use of some database abstraction
layer...typically outside the scope of the ORM tool (i.e. JDBC, ODBC or in PHP
PEAR::DB).
2. ORM
tools can produce pure PHP model classes where you generally have one model
object per database table.
3. ORM
tools can produce database schema's for the various database management systems
supported
4. ORM
tools greatly reduce the need the developers having to write SQL. Why in the
world would you want that? Well, developers can spend an inordinate amount of
time getting a SQL statement right within their code. The resulting SQL has no
gaurantee that it will adhere to SQL standards so that it can be used across
DBMSs. A good example of such things is the Geeklog 1.3.x where we have the
REPLACE INTO statements which are specific to MySQL. The other huge benefit is
by reducing the need to write SQL, the developer can concentrate on innovation.
How does this work, you ask. Well, in the example of Propel, if you have, let's
say, a story object...to save it you simply issue $myStoryObj->save().
Similar methods exist for retrieval and deletion.
5. ORM
tools honor complex relationships. Thus if you have an object that has child
objects on it (i.e. a customer who can have many addresses) when you issue a
save on the parent, the ORM tool is smart enough to save all objects and will
ensure they are wrapped in a transaction.
Not Only SQL (NOSQL)
•Relational DBs are good for structured
data
•For semi-structured and un-structured
data, some other types of DBs can be used
•Key-value stores
•Document databases
•Wide-column stores
•Graph stores
Benefits of NoSQL
•When compared to relational databases, NoSQL databases
are more scalable and provide superior performance, and their data model
addresses several issues that the relational model is not designed to address:
•Large volumes of rapidly changing structured,
semi-structured, and unstructured data
NoSQL DB servers
•MongoDB
•Cassandra
•Redis
•Amazon DynamoDB
•Hbase
Hadoop
• The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets across clusters of
computers using simple programming models.
•It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
• Rather than rely on hardware to deliver high-availability,
the library itself is designed to detect and handle failures at the application
layer, so delivering a highly-available service on top of a cluster of
computers, each of which may be prone to failures.
Hadoop core concepts
• Hadoop Distributed File System (HDFS): A distributed
file system that provides high-throughput access to application data
• Hadoop YARN: A framework for job scheduling and
cluster resource management.
• Hadoop Map
Reduce: A YARN-based system for parallel processing of large data sets.
INFORMATION
RETRIEVAL (IR)
Data in the storages should be fetched, converted into
information, and produced for proper use •Information is retrieved via search
queries
• Keyword search
• Full-text search
•The output can be
• Text
• Multimedia
The information retrieval process should be
• Fast/performance
•Scalable
•Efficient
•Reliable/Correct
Major implementations
•
Elasticsearch
• Solr
• Mainly used in search engines and recommendation
systems, with ranking
• Additionally may use
• Natural
language processing
• AI/Machine learning
• Ranking
No comments:
Post a Comment