CDAP is highly extensible, and exposes plugins, which allow users to extend its capabilities. On this page, you can see all the plugins available in CDAP. Refer to the community page to learn about contributing your own plugin.

Filter By
  • ADLSBatchSink
    Sink
    ADLSBatchSink
    Azure Data Lake Store Batch Sink writes data to Azure Data Lake Store directory in avro, orc or text format.
  • ADLSDelete
    Action
    ADLSDelete
    Deletes a file or files within ADLS file system.
  • AddField
    Transform
    AddField
    Adds a new field to each record. The field value can either be a new UUID, or it can be set directly through configuration. This transform is used when you want to add a unique id field to each record, or when you want to tag each record with some constant value. For example, you may want to add the logical start time as a field to each record.
  • Amazon S3 Client
    Action
    Amazon S3 Client
    The Amazon S3 Client Action is used to work with S3 buckets and objects before or after the execution of a pipeline.
  • Argument Setter
    Action
    Argument Setter
    Performs an HTTP request to fetch arguments to set in the pipeline.
  • AvroDynamicPartitionedDataset
    Sink
    AvroDynamicPartitionedDataset
  • AzureBlobStore
    Source
    AzureBlobStore
    Batch source to use Microsoft Azure Blob Storage as a source.
  • ADLS Batch Source
    Source
    ADLS Batch Source
    Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
  • AzureDecompress
    Action
    AzureDecompress
    Azure decompress Action plugin decompress gz files from a container on Azure Storage Blob service into another container.
  • AzureDelete
    Action
    AzureDelete
    Azure Delete Action plugin deletes a container on Azure Storage Blob service.
  • AzureFaceExtractor
    Transform
    AzureFaceExtractor
  • BigQuery Multi Table
    Sink
    BigQuery Multi Table
    This sink writes to a multiple BigQuery tables. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
  • BigQuery
    Sink
    BigQuery
    This sink writes to a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data is first written to a temporary location on Google Cloud Storage, then loaded into BigQuery from there.
  • BigQuery
    Source
    BigQuery
    This source reads the entire contents of a BigQuery table. BigQuery is Google's serverless, highly scalable, enterprise data warehouse. Data from the BigQuery table is first exported to a temporary location on Google Cloud Storage, then read into the pipeline from there.
  • Bigtable
    Sink
    Bigtable
    This sink writes data to Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
  • Bigtable
    Source
    Bigtable
    This source reads data from Google Cloud Bigtable. Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
  • CSVFormatter
    Transform
    CSVFormatter
  • CSVParser
    Transform
    CSVParser
  • Cassandra
    Sink
    Cassandra
    Batch sink to use Apache Cassandra as a sink.
  • Cassandra
    Source
    Cassandra
    Batch source to use Apache Cassandra as a source.
  • CloneRecord
    Transform
    CloneRecord
    Makes a copy of every input record received for a configured number of times on the output. This transform does not change any record fields or types. It's an identity transform.
  • Compressor
    Transform
    Compressor
    Compresses configured fields. Multiple fields can be specified to be compressed using different compression algorithms. Plugin supports SNAPPY, ZIP, and GZIP types of compression of fields.
  • Conditional
    Condition
    Conditional
    A control flow plugin that allows conditional execution within pipelines. The conditions are specified as expressions and the variables could include values specified as runtime arguments of the pipeline, token from plugins prior to the condition and global that includes global information about pipeline like stage, pipeline, logical start time and plugin.
  • Cube
    Sink
    Cube
    Batch sink that writes data to a Cube dataset.
  • Data Profiler
    Analytics
    Data Profiler
    Calculates statistics for each input field. For every field, a total count and null count will be calculated. For numeric fields, min, max, mean, stddev, zero count, positive count, and negative count will be calculated. For string fields, min length, max length, mean length, and empty count will be calculated. For boolean fields, true and false counts will be calculated. When calculating means, only non-null values are considered.
  • Database
    Action
    Database
    Action that runs a database command.
  • Database
    Sink
    Database
    Writes records to a database table. Each record will be written to a row in the table.
  • Database
    Source
    Database
    Reads from a database using a configurable SQL query. Outputs one record for each row returned by the query.
  • DatabaseQuery
    Action
    DatabaseQuery
    Runs a database query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • Google Cloud Datastore
    Sink
    Google Cloud Datastore
    This sink writes data to Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
  • Datastore
    Source
    Datastore
    This source reads data from Google Cloud Datastore. Datastore is a NoSQL document database built for automatic scaling and high performance.
  • DateTransform
    Transform
    DateTransform
    This transform takes a date in either a unix timestamp or a string, and converts it to a formatted string. (Macro-enabled)
  • Db2
    Action
    Db2
    Action that runs a DB2 command.
  • IBM DB2
    Sink
    IBM DB2
    Writes records to a DB2 table. Each record will be written to a row in the table.
  • Db2
    Source
    Db2
    Reads from a DB2 using a configurable SQL query. Outputs one record for each row returned by the query.
  • Db2
    Action
    Db2
    Runs a DB2 query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • Field Decoder
    Transform
    Field Decoder
  • Decompress
    Action
    Decompress
  • Field Decompressor
    Transform
    Field Decompressor
  • Decryptor
    Transform
    Decryptor
    Decrypts one or more fields in input records using a keystore that must be present on all nodes of the cluster.
  • Deduplicate
    Analytics
    Deduplicate
  • Distinct
    Analytics
    Distinct
    De-duplicates input records so that all output records are distinct. Can optionally take a list of fields, which will project out all other fields and perform a distinct on just those fields.
  • DynHBase
    Sink
    DynHBase
    This plugin supports writing dynamic schemas record to local or remote HBase Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
  • CDAP Table with Dynamic Schema
    Sink
    CDAP Table with Dynamic Schema
    This plugin supports writing dynamic schemas record to CDAP Dataset Table. In addition to writing dynamic schema tables, it also support regular structured records to be written to Tables.
  • ADLS Batch Sink
    Sink
    ADLS Batch Sink
  • DynamicMultiFileset
    Sink
    DynamicMultiFileset
    This plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple filesets in text format. Each fileset it writes to will contain a single 'ingesttime' partition, which will contain the logical start time of the pipeline run. The plugin expects that the filsets it needs to write to will be set as pipeline arguments, where the key is 'multisink.[fileset]' and the value is the fileset schema. Normally, you rely on the MultiTableDatabase source to set those pipeline arguments, but they can also be manually set or set by an Action plugin in your pipeline. The sink will expect each record to contain a special split field that will be used to determine which records are written to each fileset. For example, suppose the the split field is 'tablename'. A record whose 'tablename' field is set to 'activity' will be written to the 'activity' fileset.
  • Elasticsearch
    Sink
    Elasticsearch
    Takes the Structured Record from the input source and converts it to a JSON string, then indexes it in Elasticsearch using the index, type, and idField specified by the user. The Elasticsearch server should be running prior to creating the application.
  • Elasticsearch
    Source
    Elasticsearch
    Pulls documents from Elasticsearch according to the query specified by the user and converts each document to a Structured Record with the fields and schema specified by the user. The Elasticsearch server should be running prior to creating the application.
  • Email
    Action
    Email
    Sends an email at the end of a pipeline run.
  • Field Encoder
    Transform
    Field Encoder
  • Encryptor
    Transform
    Encryptor
    Encrypts one or more fields in input records using a java keystore that must be present on all nodes of the cluster.
  • ErrorCollector
    Error Handler
    ErrorCollector
    The ErrorCollector plugin takes errors emitted from the previous stage and flattens them by adding the error message, code, and stage to the record and outputting the result.
  • Excel
    Source
    Excel
    The Excel plugin provides user the ability to read data from one or more Excel file(s).
  • FTP
    Source
    FTP
    Batch source for an FTP or SFTP source. Prefix of the path ('ftp://...' or 'sftp://...') determines the source server type, either FTP or SFTP.
  • FTPCopy
    Action
    FTPCopy
    Copy files from FTP server to the specified destination.
  • FTPPut
    Action
    FTPPut
    Copy files to FTP server from a specified destination.
  • Fail Pipeline
    Sink
    Fail Pipeline
    Batch Sink is used to fail the running pipeline when any of the record flows to this sink on receiving the first record itself.
  • FastFilter
    Transform
    FastFilter
    Filters out messages based on a specified criteria.
  • File
    Sink
    File
    Writes to a filesystem in various formats format.
  • File
    Source
    File
    This source is used whenever you need to read from a distributed file system. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
  • File
    Source
    File
    File streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.
  • FileAppender
    Sink
    FileAppender
    Writes to a CDAP FileSet in text format. HDFS append must be enabled for this to work. One line is written for each record sent to the sink. All record fields are joined using a configurable separator. Each time a batch is written, the sink will examine all existing files in the output directory. If there are any files that are smaller in size than the size threshold, or more recent than the age threshold, new data will be appended to those files instead of written to new files.
  • FileContents
    Action
    FileContents
    This action plugin can be used to check if a file is empty or if the contents of a file match a given pattern.
  • FileDelete
    Action
    FileDelete
    Deletes a file or files.
  • FileMove
    Action
    FileMove
    Moves a file or files.
  • Google Cloud Storage
    Sink
    Google Cloud Storage
    This plugin writes records to one or more files in a directory on Google Cloud Storage. Files can be written in various formats such as csv, avro, parquet, and json.
  • GCSBucketCreate
    Action
    GCSBucketCreate
    This plugin creates objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
  • GCSBucketDelete
    Action
    GCSBucketDelete
    This plugin deletes objects in a Google Cloud Storage bucket. Cloud Storage allows world-wide storage and retrieval of any amount of data at any time.
  • GCSCopy
    Action
    GCSCopy
    This plugin copies objects from one Google Cloud Storage bucket to another. A single object can be copied, or a directory of objects can be copied.
  • GCSFile
    Source
    GCSFile
    This plugin reads objects from a path in a Google Cloud Storage bucket.
  • GCSMove
    Action
    GCSMove
    This plugin moves objects from one Google Cloud Storage bucket to another. A single object can be moved, or a directory of objects can be moved.
  • GCSMultiFiles
    Sink
    GCSMultiFiles
    This plugin is normally used in conjunction with the MultiTableDatabase batch source to write records from multiple databases into multiple directories in various formats. The plugin expects that the directories it needs to write to will be set as pipeline arguments, where the key is 'multisink.[directory]' and the value is the schema of the data.
  • GooglePublisher
    Sink
    GooglePublisher
    This sink writes to a Google Cloud Pub/Sub topic. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
  • GoogleSubscriber
    Source
    GoogleSubscriber
    This sources reads from a Google Cloud Pub/Sub subscription in realtime. Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
  • Group By
    Analytics
    Group By
  • HBase
    Sink
    HBase
    Writes records to a column family in an HBase table with one record field mapping to the rowkey, and all other record fields mapping to table column qualifiers. This sink differs from the Table sink in that it does not use CDAP datasets, but writes to HBase directly.
  • HBase
    Source
    HBase
    Batch source that reads from a column family in an HBase table. This source differs from the Table source in that it does not use a CDAP dataset, but reads directly from HBase.
  • HTTP
    Sink
    HTTP
    Sink plugin to send the messages from the pipeline to an external http endpoint.
  • HTTPCallback
    Action
    HTTPCallback
    Performs an HTTP request at the end of a pipeline run.
  • HTTPPoller
    Source
    HTTPPoller
  • HTTPToHDFS
    Action
    HTTPToHDFS
    Action to fetch data from an external http endpoint and create a file in HDFS.
  • MD5/SHA Field Dataset
    Transform
    MD5/SHA Field Dataset
  • Hive Bulk Export
    Action
    Hive Bulk Export
  • Hive Bulk Import
    Action
    Hive Bulk Import
  • JSONFormatter
    Transform
    JSONFormatter
  • JSONParser
    Transform
    JSONParser
    Parses an input JSON event into a record. The input JSON event could be either a map of string fields to values or it could be a complex nested JSON structure. The plugin allows you to express JSON paths for extracting fields from complex nested input JSON.
  • JavaScript
    Transform
    JavaScript
    Executes user-provided JavaScript that transforms one record into zero or more records. Input records are converted into JSON objects which can be directly accessed in JavaScript. The transform expects to receive a JSON object as input, which it can process and emit zero or more records or emit error using the provided emitter object.
  • Joiner
    Analytics
  • KVTable
    Sink
    KVTable
    Writes records to a KeyValueTable, using configurable fields from input records as the key and value.
  • KVTable
    Source
    KVTable
    Reads the entire contents of a KeyValueTable, outputting records with a 'key' field and a 'value' field. Both fields are of type bytes.
  • Kafka
    alert
  • Kafka
    Sink
    Kafka
    Kafka sink that allows you to write events into CSV or JSON to kafka. Plugin has the capability to push the data to a Kafka topic. It can also be configured to partition events being written to kafka based on a configurable key. The sink can also be configured to operate in sync or async mode and apply different compression types to events. This plugin uses kafka 0.10.2 java apis.
  • Kafka
    Source
    Kafka
    Kafka batch source. Emits the record from kafka. It will emit a record based on the schema and format you use, or if no schema or format is specified, the message payload will be emitted. The source will remember the offset it read last run and continue from that offset for the next run. The Kafka batch source supports providing additional kafka properties for the kafka consumer, reading from kerberos-enabled kafka and limiting the number of records read. This plugin uses kafka 0.10.2 java apis.
  • Kafka
    Source
    Kafka
    Kafka streaming source. Emits a record with the schema specified by the user. If no schema is specified, it will emit a record with two fields: 'key' (nullable string) and 'message' (bytes). This plugin uses kafka 0.10.2 java apis.
  • KafkaAlerts
    Alert Publisher
    KafkaAlerts
    Kafka Alert Publisher that allows you to publish alerts to kafka as json objects. The plugin internally uses kafka producer apis to publish alerts. The plugin allows to specify kafka topic to use for publishing and other additional kafka producer properties. This plugin uses kafka 0.10.2 java apis.
  • KinesisSink
    Sink
    KinesisSink
    Kinesis sink that outputs to a specified Amazon Kinesis Stream.
  • KinesisSource
    Source
    KinesisSource
    Spark streaming source that reads from AWS Kinesis streams.
  • Kudu
    Sink
    Kudu
    CDAP Plugin for ingesting data into Apache Kudu. Plugin can be configured for both batch and real-time pipelines.
  • Kudu
    Source
    Kudu
    CDAP Plugin for reading data from Apache Kudu table.
  • LoadToSnowflake
    Action
    LoadToSnowflake
  • LogParser
    Transform
    LogParser
    Parses logs from any input source for relevant information such as URI, IP, browser, device, HTTP status code, and timestamp.
  • MLPredictor
    Analytics
    MLPredictor
    Uses a model trained by the ModelTrainer plugin to add a prediction field to incoming records. The same features used to train the model must be present in each input record, but input records can also contain additional non-feature fields. If the trained model uses categorical features, and if the record being predicted contains new categories, that record will be dropped. For example, suppose categorical feature 'city' was used to train a model that predicts housing prices. If an incoming record has 'New York' as the city, but 'New York' was not in the training set, that record will be dropped.
  • MultiFieldAdder
    Transform
    MultiFieldAdder
    Multi Field Adder Transform allows you to add one or more fields to the output. Each field specified has a name and the value. The value is currently set to be of type string.
  • MultiTableDatabase
    Source
    MultiTableDatabase
    Reads from multiple tables within a database using JDBC. Often used in conjunction with the DynamicMultiFileset sink to perform dumps from multiple tables to HDFS files in a single pipeline. The source will output a record for each row in the tables it reads, with each record containing an additional field that holds the name of the table the record came from. In addition, for each table that will be read, this plugin will set pipeline arguments where the key is 'multisink.[tablename]' and the value is the schema of the table. This is to make it work with the DynamicMultiFileset.
  • MySQL Execute
    Action
    MySQL Execute
    Action that runs a MySQL command.
  • MySQL
    Sink
    MySQL
    Writes records to a MySQL table. Each record will be written to a row in the table.
  • MySQL
    Source
    MySQL
    Reads from a MySQL instance using a configurable SQL query. Outputs one record for each row returned by the query.
  • Mysql
    Action
    Mysql
    Runs a MySQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • NGramTransform
    Analytics
    NGramTransform
    Transforms the input features into n-grams, where n-gram is a sequence of n tokens (typically words) for some integer 'n'.
  • Netezza Execute
    Action
    Netezza Execute
    Action that runs a Netezza command.
  • Netezza
    Sink
    Netezza
    Writes records to a Netezza table. Each record will be written to a row in the table.
  • Netezza
    Source
    Netezza
    Reads from a Netezza using a configurable SQL query. Outputs one record for each row returned by the query.
  • Netezza
    Action
    Netezza
    Runs a Netezza query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • Normalize
    Transform
    Normalize
    Normalize is a transform plugin that breaks one source row into multiple target rows. Attributes stored in the columns of a table or a file may need to be broken into multiple records: for example, one record per column attribute. In general, the plugin allows the conversion of columns to rows.
  • NullFieldSplitter
    Transform
    NullFieldSplitter
  • ORCDynamicPartitionedDataset
    Sink
    ORCDynamicPartitionedDataset
  • Oracle
    Action
    Oracle
    Action that runs an Oracle command.
  • Oracle
    Sink
    Oracle
    Writes records to an Oracle table. Each record will be written to a row in the table.
  • Oracle
    Source
    Oracle
    Reads from an Oracle table using a configurable SQL query. Outputs one record for each row returned by the query.
  • Oracle
    Action
    Oracle
    Runs an Oracle query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • OracleExport
    Action
    OracleExport
    A Hydrator Action plugin to efficiently export data from Oracle to HDFS or local file system. The plugin uses Oracle's command line tools to export data. The data exported from this tool can then be used in Hydrator pipelines.
  • PDFExtractor
    Transform
    PDFExtractor
  • ParquetDynamicPartitionedDataset
    Sink
    ParquetDynamicPartitionedDataset
  • Postgres
    Action
    Postgres
    Action that runs a PostgreSQL command.
  • Postgres
    Sink
    Postgres
    Writes records to a PostgreSQL table. Each record will be written to a row in the table.
  • Postgres
    Source
    Postgres
    Reads from a PostgreSQL using a configurable SQL query. Outputs one record for each row returned by the query.
  • Postgres
    Action
    Postgres
    Runs a PostgreSQL query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • Projection
    Transform
    Projection
    The Projection transform lets you drop, keep, rename, and cast fields to a different type. Fields are first dropped based on the drop or keep field, then cast, then renamed.
  • PySparkProgram
    Action
    PySparkProgram
    Executes user-provided Spark code in Python.
  • Python
    Transform
    Python
    Executes user-provided python code that transforms one record into zero or more records. Each input record is converted into a dictionary which can be directly accessed in python. The transform expects to receive a dictionary as input, which it can process and emit zero or more transformed dictionaries, or emit an error dictionary using the provided emitter object.
  • RecordSplitter
    Transform
    RecordSplitter
    Given a field and a delimiter, emits one record for each split of the field.
  • RedshiftToS3
    Action
    RedshiftToS3
  • Repartitioner
    Analytics
    Repartitioner
    This plugins re-partitions a Spark RDD.
  • RowDenormalizer
    Analytics
    RowDenormalizer
    Converts raw data into denormalized data based on a key column. User is able to specify the list of fields that should be used in the denormalized record, with an option to use an alias for the output field name. For example, 'ADDRESS' in the input is mapped to 'addr' in the output schema.
  • Run
    Transform
    Run
    Runs an executable binary which is installed and available on the local filesystem of the Hadoop nodes. Run transform plugin allows the user to read the structured record as input and returns the output record, to be further processed downstream in the pipeline.
  • S3
    Sink
    S3
    This sink is used whenever you need to write to Amazon S3 in various formats. For example, you might want to create daily snapshots of a database by reading the entire contents of a table, writing to this sink, and then other programs can analyze the contents of the specified file.
  • Amazon S3
    Source
    Amazon S3
    This source is used whenever you need to read from Amazon S3. For example, you may want to read in log files from S3 every hour and then store the logs in a TimePartitionedFileSet.
  • S3ToRedshift
    Action
    S3ToRedshift
    S3ToRedshift Action that will load the data from AWS S3 bucket into the AWS Redshift table.
  • SFTPCopy
    Action
    SFTPCopy
  • SFTPDelete
    Action
    SFTPDelete
  • SFTPPut
    Action
  • Remote Program Executor
    Action
    Remote Program Executor
    Establishes an SSH connection with remote machine to execute command on that machine.
  • Salesforce
    Sink
    Salesforce
    A batch sink that inserts sObjects into Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom objects, etc.
  • Salesforce
    Source
    Salesforce
    This source reads sObjects from Salesforce. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
  • Salesforce
    Source
    Salesforce
    This source tracks updates in Salesforce sObjects. Examples of sObjects are opportunities, contacts, accounts, leads, any custom object, etc.
  • Salesforce Marketing
    Sink
    Salesforce Marketing
    This sink inserts records into a Salesforce Marketing Cloud Data Extension. The sink requires Server-to-Server integration with the Salesforce Marketing Cloud API. See https://developer.salesforce.com/docs/atlas.en-us.mc-app-development.meta/mc-app-development/api-integration.htm for more information about creating an API integration.
  • SalesforceMultiObjects
    Source
    SalesforceMultiObjects
    This source reads multiple sObjects from Salesforce. The data which should be read is specified using list of sObjects and incremental or range date filters. The source will output a record for each row in the SObjects it reads, with each record containing an additional field that holds the name of the SObject the record came from. In addition, for each SObject that will be read, this plugin will set pipeline arguments where the key is 'multisink.[SObjectName]' and the value is the schema of the SObject.
  • Sampling
    Analytics
    Sampling
    Sampling a large dataset flowing through this plugin to pull random records. Supports two types of sampling i.e, Systematic Sampling and Reservoir Sampling.
  • ScalaSparkCompute
    Analytics
    ScalaSparkCompute
    Executes user-provided Spark code in Scala that transforms RDD to RDD with full access to all Spark features.
  • ScalaSparkProgram
    Action
    ScalaSparkProgram
    Executes user-provided Spark code in Scala.
  • Spark
    Sink
    Spark
    Executes user-provided Spark code in Scala that operates on an input RDD or Dataframe with full access to all Spark features.
  • Avro Snapshot Dataset
    Sink
    Avro Snapshot Dataset
    A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Avro format. A corresponding SnapshotAvro source can be used to read only the most recently written snapshot.
  • Avro Snapshot Dataset
    Source
    Avro Snapshot Dataset
    A batch source that reads from a corresponding SnapshotAvro sink. The source will only read the most recent snapshot written to the sink.
  • Parquet Snapshot Dataset
    Sink
    Parquet Snapshot Dataset
    A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Parquet format. A corresponding SnapshotParquet source can be used to read only the most recently written snapshot.
  • Parquet Snapshot Dataset
    Source
    Parquet Snapshot Dataset
    A batch source that reads from a corresponding SnapshotParquet sink. The source will only read the most recent snapshot written to the sink.
  • SnapshotText
    Sink
    SnapshotText
    A batch sink for a PartitionedFileSet that writes snapshots of data as a new partition. Data is written in Text format.
  • Google Cloud Spanner
    Sink
    Google Cloud Spanner
    This sink writes to a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
  • Spanner
    Source
    Spanner
    This source reads from a Google Cloud Spanner table. Cloud Spanner is a fully managed, mission-critical, relational database service that offers transactional consistency at global scale, schemas, SQL (ANSI 2011 with extensions), and automatic, synchronous replication for high availability.
  • Google Cloud Speech-to-Text
    Transform
    Google Cloud Speech-to-Text
    This plugin converts audio files to text by using Google Cloud Speech-to-Text.
  • SQL Server Execute
    Action
    SQL Server Execute
    Action that runs a SQL Server command.
  • SQL Server
    Sink
    SQL Server
    Writes records to a SQL Server table. Each record will be written to a row in the table.
  • SQL Server
    Source
    SQL Server
    Reads from a SQL Server using a configurable SQL query. Outputs one record for each row returned by the query.
  • SqlServer
    Action
    SqlServer
    Runs a SQL Server query at the end of the pipeline run. Can be configured to run only on success, only on failure, or always at the end of the run.
  • StructuredRecordToGenericRecord
    Transform
    StructuredRecordToGenericRecord
    Transforms a StructuredRecord into an Avro GenericRecord.
  • Transactional Alert Publisher
    Alert Publisher
    Transactional Alert Publisher
    Publishes alerts to the CDAP Transactional Messaging System (TMS) as json objects. The plugin allows you to specify the topic and namespace to publish to, as well as a rate limit for the maximum number of alerts to publish per second.
  • Avro Time Partitioned Dataset
    Sink
    Avro Time Partitioned Dataset
  • TPFSAvro
    Source
    TPFSAvro
    Reads from a TimePartitionedFileSet whose data is in Avro format.
  • ORC Time Partitioned Dataset
    Sink
    ORC Time Partitioned Dataset
  • TPFSParquet
    Sink
    TPFSParquet
  • Parquet Time Partitioned Dataset
    Source
    Parquet Time Partitioned Dataset
    Reads from a TimePartitionedFileSet whose data is in Parquet format.
  • CDAP Table Dataset
    Sink
    CDAP Table Dataset
    Writes records to a CDAP Table with one record field mapping to the Table rowkey, and all other record fields mapping to Table columns.
  • Table
    Source
    Table
    Reads the entire contents of a CDAP Table. Outputs one record for each row in the Table. The Table must conform to a given schema.
  • TopN
    Analytics
    TopN
    Top-N returns the top "n" records from the input set, based on the criteria specified in the plugin configuration.
  • Trash
    Sink
    Trash
    Trash consumes all the records on the input and eats them all, means no output is generated or no output is stored anywhere.
  • Twitter
    Source
    Twitter
    Samples tweets in real-time through Spark streaming. Output records will have this schema:
  • UnionSplitter
    Transform
    UnionSplitter
    The union splitter is used to split data by a union schema, so that type specific logic can be done downstream.
  • Validator
    Transform
    Validator
    Validates a record, writing to an error dataset if the record is invalid. Otherwise it passes the record on to the next stage.
  • ValueMapper
    Transform
    ValueMapper
    Value Mapper is a transform plugin that maps string values of a field in the input record to a mapping value using a mapping dataset.
  • VerticaBulkExportAction
    Action
    VerticaBulkExportAction
    Bulk exports data in a vertica table into a file.
  • VerticaBulkImportAction
    Action
    VerticaBulkImportAction
    Vertica Bulk Import Action plugin gets executed after successful mapreduce or spark job. It reads all the files in a given directory and bulk imports contents of those files into vertica table.
  • Window
    Analytics
    Window
    The Window plugin is used to window a part of a streaming pipeline.
  • WindowsShareCopy
    Action
    WindowsShareCopy
    Copies a file or files on a Microsoft Windows share to an HDFS directory.
  • Wrangler
    Transform
    Wrangler
    This plugin applies data transformation directives on your data records. The directives are generated either through an interactive user interface or by manual entry into the plugin.
  • XMLMultiParser
    Transform
    XMLMultiParser
    The XML Multi Parser Transform uses XPath to extract fields from an XML document. It will generate records from the children of the element specified by the XPath. If there is some error parsing the document or building the record, the problematic input record will be dropped.
  • XMLParser
    Transform
    XMLParser
    The XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.
  • XMLReader
    Source
    XMLReader
    The XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.
  • XML to Json String
    Transform
    XML to Json String
    Accepts a field that contains a properly-formatted XML string and outputs a properly-formatted JSON string version of the data. This is meant to be used with the Javascript transform for the parsing of complex XML documents into parts. Once the XML is a JSON string, you can convert it into a Javascript object using: