Oracle Datapump was introduced with 10gR1 as a replacement to the classic export/import utilities. In this post we’ll go through the internals of Datapump to see how it works.
Oracle added two new packages DBMS_DATAPUMP & DBMS_METADATA which forms the base for the datapump export import operations. So either you use the command line functionality or via OEM you basically work with DBMS_DATAPUMP & DBMS_METADATA packages.
The most important part of the architecture is the master table, it is a directory that records and maintains the current state & location of every object being exported or imported. Master table also maintains the details about status of all the sub-processes (worker process) forked from the master control process.
Master table is created in the schema of the user running the datapump job during the start of the execution. In an export job the information in this master table is written to the dumpfile as a last step whereas in an import job this information is loaded to the master table as a first step so that master table can be used to sequence the creation of all the objects. In both the case once the job is successfully completed this master table is dropped from the schema.
Following is the master table structure created during an schema export job:-
SQL&amp;gt; desc "ADMIN"."SYS_EXPORT_SCHEMA_01" Name Null? Type ----------------------------------------- -------- ---------------------------- PROCESS_ORDER NUMBER DUPLICATE NUMBER DUMP_FILEID NUMBER DUMP_POSITION NUMBER DUMP_LENGTH NUMBER DUMP_ORIG_LENGTH NUMBER DUMP_ALLOCATION NUMBER COMPLETED_ROWS NUMBER ERROR_COUNT NUMBER ELAPSED_TIME NUMBER OBJECT_TYPE_PATH VARCHAR2(200) OBJECT_PATH_SEQNO NUMBER OBJECT_TYPE VARCHAR2(30) IN_PROGRESS CHAR(1) OBJECT_NAME VARCHAR2(500) OBJECT_LONG_NAME VARCHAR2(4000) OBJECT_SCHEMA VARCHAR2(30) ORIGINAL_OBJECT_SCHEMA VARCHAR2(30) ORIGINAL_OBJECT_NAME VARCHAR2(4000) PARTITION_NAME VARCHAR2(30) SUBPARTITION_NAME VARCHAR2(30) DATAOBJ_NUM NUMBER FLAGS NUMBER PROPERTY NUMBER TRIGFLAG NUMBER CREATION_LEVEL NUMBER COMPLETION_TIME DATE OBJECT_TABLESPACE VARCHAR2(30) SIZE_ESTIMATE NUMBER OBJECT_ROW NUMBER PROCESSING_STATE CHAR(1) PROCESSING_STATUS CHAR(1) BASE_PROCESS_ORDER NUMBER BASE_OBJECT_TYPE VARCHAR2(30) BASE_OBJECT_NAME VARCHAR2(30) BASE_OBJECT_SCHEMA VARCHAR2(30) ANCESTOR_PROCESS_ORDER NUMBER DOMAIN_PROCESS_ORDER NUMBER PARALLELIZATION NUMBER UNLOAD_METHOD NUMBER LOAD_METHOD NUMBER GRANULES NUMBER SCN NUMBER GRANTOR VARCHAR2(30) XML_CLOB CLOB PARENT_PROCESS_ORDER NUMBER NAME VARCHAR2(30) VALUE_T VARCHAR2(4000) VALUE_N NUMBER IS_DEFAULT NUMBER FILE_TYPE NUMBER USER_DIRECTORY VARCHAR2(4000) USER_FILE_NAME VARCHAR2(4000) FILE_NAME VARCHAR2(4000) EXTEND_SIZE NUMBER FILE_MAX_SIZE NUMBER PROCESS_NAME VARCHAR2(30) LAST_UPDATE DATE WORK_ITEM VARCHAR2(30) OBJECT_NUMBER NUMBER COMPLETED_BYTES NUMBER TOTAL_BYTES NUMBER METADATA_IO NUMBER DATA_IO NUMBER CUMULATIVE_TIME NUMBER PACKET_NUMBER NUMBER INSTANCE_ID NUMBER OLD_VALUE VARCHAR2(4000) SEED NUMBER LAST_FILE NUMBER USER_NAME VARCHAR2(30) OPERATION VARCHAR2(30) JOB_MODE VARCHAR2(30) QUEUE_TABNUM NUMBER CONTROL_QUEUE VARCHAR2(30) STATUS_QUEUE VARCHAR2(30) REMOTE_LINK VARCHAR2(4000) VERSION NUMBER JOB_VERSION VARCHAR2(30) DB_VERSION VARCHAR2(30) TIMEZONE VARCHAR2(64) STATE VARCHAR2(30) PHASE NUMBER GUID RAW(16) START_TIME DATE BLOCK_SIZE NUMBER METADATA_BUFFER_SIZE NUMBER DATA_BUFFER_SIZE NUMBER DEGREE NUMBER PLATFORM VARCHAR2(101) ABORT_STEP NUMBER INSTANCE VARCHAR2(60) CLUSTER_OK NUMBER SERVICE_NAME VARCHAR2(100) OBJECT_INT_OID VARCHAR2(32) SQL&amp;gt;
The expdp, impdp, OEM Interface and Custom Interface are the client process provided by oracle by which we can use the datapump APIs.
This process is created when Client logs in to database and processes the the datapump APIs requests (DBMS_DATAPUMP.OPEN). It is responsible for creating jobs to further create the master table and master control process.
Master Control Process
Master control process (MCP) which is created by shadow process upon START_JOB request from Datapump API is the master process that controls the execution and sequencing for the datapump job with the information fetched from master table. MCP further forks several worker process according to the value of PARALLEL parameter and dispatch work regarding unloading and loading of data. MCP has a process name as <Instance Name>_DMnn_<pid>.
The job of worker process is to unload/load the data and metadata from/to the database and also to update the master table. So as the objects are unloaded/loaded the status (PENDING, COMPLETED, FAILED etc.) along with other information (No. of rows, size, object type etc.) is updated in the master table. Worker process has a process name as <Instance Name>_DWnn_<pid>.
Parallel Query process
For further parallel processing specially in the case where table partitioning is involved, even the worker process creates some parallel query (PQ) processes and further assigns the unloading/loading assignment to these PQ processes and act as a query coordinator.
Command and control queue
All the processes related to Datapump jobs described above have subscription of this queue. Command and control queue is used for interprocess communication between all the datapump processes to coordinate all the API commands, loading/unloading requests & responses, file requests, logging etc. This queue is named as KUPC$C_<job-unique>_<timestamp>.
The shadow process subscribes to read from this queue and is used to receive error messages from MCP which is the only process that can write to this queue. This queue is named as KUPC$S_<job-unique>_<timestamp>.
This was all about Datapump architecture that you need to know, now its time to see the methods used by Oracle for Data movement.
Data Pump has 3 options for moving the data out/in from/to the database, and based on several performance related factors & user specified arguments, MCP chooses the fastest option for that particular operation.
The available options are:-
- Direct Path Loads and Unloads.
- External Tables.
- Insert as Select over a network
Unless explicitly specified by the user MCP chooses the best option based on the factors like the type used in table definition, whether or not REMAP or NETWORK_LINK clauses are used, total size of data to be exported/imported, PARALLEL clause etc.
For example if there is only 1 large table that is to be exported, MCP will choose external tables with max. parallelization so that worker process could use parallel query slaves to move the data. On the contrary if the table size would have been small then using external tables would not be beneficial compared to direct path unload.
That is enough for today (at least for me), in the next post on datapump internals, I’ll show how these processes use the master table information to attach/detach, stop, restart, kill, resume datapump jobs.
Until then enjoy the internals and try some brainstorming with the strace of a expdp which you can find here.