Friday, 19 December 2014

Content Processing Framework



Content Processing Framework (CPF) in Mark Logic   

Content processing framework is a very important concept of Mark Logic which most commonly used to process content for some tailoring and editing on the basis of certain rules. It actually provides the way and options to process content data/document through a sequence of steps to convert data in desired and usable format. 

Today, we are going to discuss key points of CPF which is important to know while implementing CPF based application.  

Overview to Content Processing Framework

Content processing application is often consist of multiple steps and each step of process performs specific task or set of tasks. Each step updates the state of document towards the final with individual commit to database for each step.

Component of CPF

Content processing framework is consist and based on following mandatory components to process data  through CPF to make it more usable and effective.

1. Domains
2. Pipelines
3. Modules and XQuery functions for content processing

1. Domains
Domain is the scope to define set of documents for content processing through CPF. Domains are used to define a group of document/data for specific type of processing however same document can be considered in another domain as well for some other processing. 

2. Pipelines
In simple words we can say that pipeline is a tunnel that consist of various steps and every step does some modification/updates in document content and states and moves towards the final stage where document can be directly used for efficient results. 

Pipeline is a xml document created in a specific format with specific element items which specifies the steps to process a document through CPF. The pipeline xml contains following key points to define processing through CPF. Snippet for pipeline xml is also given below.

<?xml-stylesheet href="/cpf/pipelines.css" type="text/css"?>
<pipeline xmlns="http://marklogic.com/cpf/pipelines" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://marklogic.com/cpf/pipelines pipelines.xsd">
  <pipeline-name>Add-last-date</pipeline-name>
  <pipeline-description>
This pipeline add a new property in document with last-modified
  </pipeline-description>
  <success-action>
    <module>/MarkLogic/cpf/actions/success-action.xqy</module>
  </success-action>
  <failure-action>
    <module>/MarkLogic/cpf/actions/failure-action.xqy</module>
  </failure-action>
  <state-transition>
    <annotation>
      when a document inserted or updated add a new property in document with last-modified
    </annotation>
    <state>http://marklogic.com/states/initial</state>
    <on-success>http://marklogic.com/states/converted</on-success>
    <on-failure>http://marklogic.com/states/error</on-failure>
    <priority>9200</priority>
    <execute>
      <condition>
        <module>/MarkLogic/cpf/actions/mimetype-condition.xqy</module>
        <options xmlns="/MarkLogic/cpf/actions/mimetype-condition.xqy">
          <mime-type>xml</mime-type>
        </options>
      </condition>
      <action>
        <module>/MarkLogic/conversion/actions/convert-html-action.xqy</module>
        <options
         xmlns="/MarkLogic/conversion/actions/convert-html-action.xqy">
          <destination-root/>
          <destination-collection/>
        </options>
      </action>
    </execute>
  </state-transition>
  <!-- States converted and error not handled here -->
</pipeline>


1. Steps : - A Pipeline contains all steps (indicated by state-transition’ tag in xml) that need to be processed while document of specific domain is being processed through CPF.
2. States :- A Pipeline contains information of document state (indicated by state tag in xml) that need to be processed in specific step of CPF. Single document can be processed differently on the basis of state. For ex. If document in inset/initial state we can perform different process and if document in update state then we can perform different process.
3. Success Action: - Pipeline xml provides information of success action (indicated by ‘success-action’ tag in xml) that need to be performed when pipeline process is completed. It is shown by On Success” in already configured pipeline in CPF.
4. Failure Action :- Pipeline xml provides information of failure action (indicated by ‘failure-action’ tag in xml) that need to be performed when pipeline process is completed. It is shown by On Success” in already configured pipeline in CPF.
5. State of document after success of particular step process:- Pipeline xml provides information about state of document, after successful completion of process of specific step in pipeline. As already discussed earlier that each step is executed separately and results committed after each step. So this configuration section informs about state of document committed after process of a pipeline step.
6. Execute :- This section of pipeline xml contains information of conditions that need to be satisfied to process specific documentation and action that need to be performed if condition satisfied.
7. Condition :- This section contains location of module which results in boolean result and executes specific condition which should be satisfied by document being processed than only it is going to be processed by action module. For ex. A file must be XML then only perform some action on that in that case a condition module that verifies mime type of document is used as condition.
8. Action :- This section contains location of module which is written to process specific document if associated condition matched.



Below is the sample of code in action module which adds last-updated” element in all xml documents.

xquery version "1.0-ml";
import module namespace cpf="http://marklogic.com/cpf" 
at "/MarkLogic/cpf/cpf.xqy";
declare variable $cpf:document-uri as xs:string external;
declare variable $cpf:transition as node() external;
if (cpf:check-transition($cpf:document-uri,$cpf:transition)) then try 
{
let $doc := fn:doc($cpf:document-uri)
return
xdmp:node-insert-child(
$doc/book,
<last-updated>{fn:current-dateTime()}</last-updated>),
),
xdmp:log( "add last-updated ran OK" ),
cpf:success($cpf:document-uri, $cpf:transition, ())
} catch ($e) {
cpf:failure($cpf:document-uri, $cpf:transition, $e, ())
}
else ()

So this is enough for start of CPF basic concepts that give you a quick idea abut CPF  framework and very important points in it to start. After that CPF has lots of other advance functionality and implementations which you can explore further. I will also try to share some simple examples with you guys.   

Till than keep exploring.


Wednesday, 4 June 2014

Introduction to Mark Logic



Mark logic is the new generation database for Big Data. Mark logic provides the trusted platform for Big Data Application and helps to search crucial information from large data as available in disconnected and non-relational way and provides the way to convert it in useful data in fast way to generate revenue. Mark Logic is based on NoSql Database technology.

                Mark Logic is based on XMLs for saving data contents. So, it will help if you have some idea of XML and related technologies. For ex – Xpath, XQuery, XSLT etc. Mark Logic is very powerful tool to search large data contents and process them in very fast way to get meaningful results for analysis and decision making etc.

                The best way to understand Mark Logic is to download its developer version and start playing it for more details. But for that you might need to know some of the basic theory about Mark Logic Database. 


So today we are going to discuss very basic level theory about terms used in Mark Logic as below.
      1.       Hosts
      2.       Database
      3.       Forest
      4.       App Servers
      5.       Modules
 

Hosts:-

A host is an instance of Mark Logic server running on a single machine. Sometime the machine installed with instance of Mark Logic also pronounced as Host. Host is always a part of a group that means a host can’t be created and configured individually. By default a host is added to default group.
Now you must be thinking that what is group? But as of now just start with that every instance of Mark Logic has a default group named as “Default”. I will cover Groups and clusters in the advance topics in Mark Logic in my later blogs. But initially we can start with default group and cluster as created as default with Mark Logic instance. 

Database:-

In Mark Logic, Database is a layer which actually doesn’t stores contents directly. Database serves as a layer of abstraction between forests and servers (HTTP, XDBC, WebDav) too access contents as saved in Mark Logic forests. A database is consists of single or multiple forests which are configured on host and forests are actually containing data which is saved in Mark Logic database. Mark Logic database provides a single point of access and contiguous set of contents to connect, query or operate on data as saved in multiple forests.

Mark Logic is installed with following supporting databases as default.

      a)      Documents – This database contains default properties and size etc. information of documents as in Mark Logic
      
      b)      Last Login – This database contains and tracks last login information in server and other accessibility to database
      
      c)       Schemas – This database contains schema information of every database. Each data base is connected to Schema database as default to save schema information however it could be saved in same database as well but it is recommended to keep it in default schema database.
     
      d)      Security – This database contains security related configuration information of every database. Every database is connected to security database to save security information and is recommended to save security information in default security database.
      
      e)      Modules – This database is used to store executable XQuery code. This database is created by   default during Mark Logic installation which we can use to keep our executable XQuery but we can also save XQuery in other database but that data base should be used as module database in HTTP or XDBC server configuration.
If we are using Modules database to keep XQuery files than each XQuery file must be prefixed with root url (as configured in HTTP or XDBC server as root ) to access XQuery file as saved in associated Modules database.
For example, if you are using a modules database and specify a root in an HTTP or XDBC server of http://marklogic.com/, the following documents are executable from that server: http://marklogic.com/default.xqy          
http://marklogic.com/myXQueryFiles/search_db.xqy
     
      f)       Triggers - Trigger database is used to store triggers. Triggers are nothing but some executables to process contents. Default trigger database is created during installation to store triggers however separate database can also be used as trigger database just like as Modules database.

Forests:-

Forests are the actual storage of contents. Forest contains data in the form of XML, text or binary documents. Forests are created on hosts and attached to a database. One database can be attached to multiple forests but one forest is attached to only one database at a time. Multiple forests attached to a database appears as a contiguous set of content for database for query purpose. However individual forest (not attached with any database) is of no use. No data can be loaded/saved in a forest which is not attached with any database.
A Forest contains in memory and on disk structure is called as stand.

App Servers:-

App servers are accessible and created/managed at group level in Mark Logic Server. Each App server could be associated with one database and configured to single port for communication. App servers are actually used to access data as saved in Mark Logic database forests from applications.    

Applications communicates with these app servers to fetch or insert/search/delete documents. There are following App Servers available in Mark Logic which has their own specific purpose and limitation.
       
       a)      HTTP Server
       b)      XDBC Server
       c)       WebDav Server
       d)      ODBC Server

HTTP Server: -

HTTP Servers in Mark Logic enables to create XQuery based web application. Using HTTP server we can execute an XQuery from web application against a database to fetch and process data in documents. HTTP Server enables us to return XHTML or XML contents to browser or other HTTP enabled client applications.
HTTP Servers are defined at group level and accessible to all hosts in a group. HTTP server provides access to set of XQuery programs which are saved in specific directory structure. HTTP servers are connected to a database on specific port and executes all respective XQuery executables against associated database using HTTP server.
HTTP server can execute XQuery code either from a specified location in file directory or from Modules database.
 Click Here to see procedures to create and manage HTTP server.

XDBC Server: -

XDBC App Servers are used for XML Contentbase Connector (XCC) applications to communicate with Mark Logic Server. XCC is an API which is used by Java and .NET to communicate with Mark Logic server. XDBC server are defined at group level and accessible to all hosts in a group. XDBC server provides access to a specific forest and to root to access set of XQuery programs that resides with in a specific directory structure.
These XDBC servers are used to insert/fetch/delete documents from Mark Logic using .Net or Java application. XDBC servers also used to access XQuery programs or library within query console of Mark Logic server.  
XDBC server provides access to a specific database/forest but using XCC connector we can communicate to any database of host with in a cluster.
Click Here to see procedures to create and manage XDBC server.


WebDav Server: -

WebDav servers are used to access database documents and programs directly in file system using WebDav client. It allows to read/write/delete documents directly from database on the basis of configured security settings. WebDavs are needed when we need to store and access our XQuery base programs in a database using specific directory in that database.
WebDav servers in Mark Logic are similar to HTTP servers, but has the following important differences-
i)                    WebDAV servers cannot execute XQuery code.
ii)                   WebDAV servers support the WebDAV protocol to allow WebDAV clients to have read and write access (depending on the security configuration) to a database.
iii)                 A WebDAV server only accesses documents and directories in a database; it does not access the file system directly.

WebDAV (Web-based Distributed Authoring and Versioning) is a protocol that extends the HTTP protocol to provide the ability to write documents through these HTTP extensions. You need a WebDAV client to write documents, but you can still read them through HTTP (through a web browser, for example).
Click Here for information on creating and configuring a WebDav Server,

ODBC Server: -

ODBC server is used to allow SQL client to communicate with Mark Logic server for database operations using SQL statements. ODBC is one of several component in Mark Logic that supports SQL queries. Basic purpose of ODBC server is to return relational style data as in Mark Logic, in response of SQL queries. The ODBC server returns data in tuple form and manages server state to support a subset of SQL and ODBC statements. DBC servers are created and managed at Group level and ODBC server associated with a specific database.
Click Here for information on creating and configuring a WebDav Server,

Modules:-

Set of XQuery base programs or executables are called as modules which are saved with .XQY extension. These modules are nothing but set of XQuery statements to fetch or process data as saved in Mark Logic. But the XQuery program will be executed on which database this is configured using Modules setting in App Servers. If App server is configured with file system in Modules setting then XQuery programs are stored in that specific directory as specified in root of App Server. If Module setting in App server is configured to some database (for ex Modules database as created default for same purpose) and we want to store our XQuery programs in that database in that case we need to create WebDav server for the configured database (i.e. Modules) so that we can access directory structure of database and could store our program in specific directory and can access with root URL by prefixing in XQY file  location where root URL of App server should be top level directory URL.

So friends, I think we talked enough theory to start playing with Mark Logic Server using these basic theory concepts. In next blog we will go for practical implementation of these. They might need separate blog for each but you can explore it more at your own as well using Mark Logic Admin guide.

Reference