Tuesday, 3 March 2015

Mark Logic Content Pump (MLCP)

Hi Friends, 

Many times there is a situation where we need to export data from mark logic or need to import data directly in some directory of Mark Logic database. This mlcp utility is hugely important and usable in such scenarios, mostly when number of documents to be imported/exported is large. So today we are going to discuss about MLCP and going to look at the process to enable import/export through MLCP. However a complete documentation is available at docs.marklogic.com but i am just going to make it very simple to the point to give you a quick start.

MarkLogic Content Pump is an open-source, Java-based command-line tool (mlcp). mlcp provides the fastest way to import, export, and copy data to or from MarkLogic databases. It is designed for integration and automation in existing workflows and scripts.

The MLCP tool has two modes of operation:
Local: MLCP drives all its work on the host where it is invoked. Resources such as import data and export destination must be reachable from that host.
Distributed: MLCP distributes its workloads across the nodes in a Hadoop cluster. Resources such as import data and export destination must be reachable from the cluster, which usually means via HDFS.
So, we are just going to discuss about local mode for today.

The very first thing we need to know before using MLCP is, what are the prerequisites for using MLCP utility. As already mentioned above that MLCP is a java based command line tool so java run time environment is mandatory to have on machine where we are planning to execute MLCP tool.

1) Java Runtime Environment is a freeware and can be downloaded from http://filehippo.com/download_jre_64/57157/

2) Now download MLCP binaries from http://developer.marklogic.com/products/mlcp and unzip it.

That’s it, we don’t need to install it separately. It can be directly executed from directory. So we are going to look  import and export functionality using MLCP but before we go for it, another important point to keep in mind is that you may need to have following privilege when importing or exporting data to/from mark logic.

hadoop-user-write - This privilege is needed when you are trying to import data in marklogic. This privilege should be allow for the role which belongs to the user as we are going to use in mlcp command during import.
hadoop-user-read - This privilege is needed when you are trying to read and export data from mark logic. This privilege should be allow for the role which belongs to the user as we are going to use in mlcp command during import.

Now lets come directly to the commands as used to import/export data to/from marklogic using MLCP.

Before executing import/export command you should use your command prompt and navigate to the bin directory as extracted from MLCP binaries package as downloaded.

Suppose your MLCP bin directory is available at “c:/mlcp/bin”. So, navigate to the “C:\mlcp\bin\” directory in command prompt.

Now look at the following commands for Export/import



Export 


C:\mlcp\bin>mlcp.bat export -host 127.0.0.1 -port 9101 -username user -password **** -output_file_path     “c:\mlcp\data” -mode local

Description - This export command will export entire data of database which belongs to the XDBC port as mentioned in “-port”argument

-host :- this argument provided with IP of host on which marklogic is installed to export data.
-port  :- this argument provided with port number of XDBC server which belongs to the database from which data need to be exported.
-username :- this argument is provided with username of mark logic which belongs to the role that has privilege for “hadoop-user-read”
-password :- this argument is provided with password to be used in respect to username to login/access mark logic server.
-output_file_path :- this argument is provided with directory location where exported files should go. This should be provided in “”
-mode :- this argument is provided with mode of mlcp command (i.e. local/distributed)   

Now apart from this command that simply exports entire data of database, we may need to export specific data from database. In such cases we have few more arguments that can help us in that case.
-directory_filter :- This argument helps to export entire data of specific directory/directories of database
Ex -

C:\mlcp\bin>mlcp.bat export -host 127.0.0.1 -port 9101 -username user -password **** -output_file_path     “c:\mlcp\data” -mode local -directory_filter /datasources/entities/person/

This command will export entire data of “/datasources/entities/person/” directory only from database

We have some other options as well where we can export data of specific collection or from specific xpath.
-collection_filter  :- this provides option to export data of specific collection only
-document_selector :- this provides option to export data which is selected through specified xpath only.

You may go to the docs.marklogic.com for more details on other available options.
 

Import


C:\mlcp\bin>mlcp.bat import -host 127.0.0.q -port 9101 -username admin
 -password **** -input_file_path "C:\mlcp\data_to_import"

Description - This import command will import entire data of specified directory in database which belongs to the XDBC port as mentioned in “-port”argument

-input_file_path :- this argument is provided with directory path to import data from. Path should we enclosed with “”

Note:- Data will we imported in same directory structure as specified in “-input_file_path”. For example as specified above in command, all imported data would be imported in “c:\mlcp\data_to_import\” directory in mark logic database.

So, there may we requirement where we need to import data in specific URI so in that case we need to modify automatically generated URI during import. “-output_uri_replace” can help us in this case.
-output_uri_replace :- this argument is provided to replace part of URI that is being generated automatically during import. This argument should be enclosed in “” and internal string should be enclosed in ‘’

-output_uri_prefix :- this argument is used to add prefix in URI as being generated during import.

Ex - lets add this argument in our previous import command as below and run it

C:\mlcp\bin>mlcp.bat import -host 127.0.0.q -port 9101 -username admin
 -password **** -input_file_path "C:\mlcp\data_to_import" -output_uri_replace "
C:/mlcp/data_to_import,''" -output_uri_prefix /datasources

This command will import data from "C:\mlcp\data_to_import" to “\datasources\{data}” because generated URI will be updated by -output_uri_replace to replace “C:/mlcp/data_to_import” with “” and “/datasources” will be added as prefix so final location would be “/datasources/{data}”.

I think this is enough for today to give you some open area to explore more with MLCP

Now, Before we end the discussion i would like to conclude minimum requirements as needed before executing import/export commands
1)  JRE must be installed on the machine to execute mlcp commands
2) XDBC App server should be created and available in mark logic pointing to the database which need to be used for import/export of data
3) “hadoop-user-write”- For import command this privilege must be given to the user role which is used in “-username” argument
4) “hadoop-user-read”- For export command this privilege should be given to the user role which is used in “-username” argument
5) -mode argument must be local and directories should be accessible from local system

So keep enjoying exploring more.

References
https://docs.marklogic.com/7.0/guide/ingestion/content-pump#id_49096

No comments:

Post a Comment