流濤集的筆記

简介

我在研究将数据收集到Elasticsearch时,看到了Elastic的官网

Homepage


https://www.elastic.co/blog/elasticsearch-plus-streamsets-reliable-data-ingestion

我做了一份备忘录来记录和介绍一个看起来非常不错的产品。

大致上来说

我知道可能有一些大的错误,但还是决定大致总结一下。

这是一种可以通过GUI连接输入/转换/输出数据并进行数据建模/收集的工具。
观看视频会更容易理解。

只要有这些,我觉得什么事都能搞定。

主な登場人物説明originsデータ入力元destinationデータ出力先processor変換処理を行うものpipelineこれらをつなげたもの

详细信息可以在http://streamsets.com/documentation/datacollector/1.1.3/help/找到。

adaptable-flows-2k.png

目前为止,origins有以下几个选项。太厉害了。

    • Amazon S3 – Reads files from Amazon S3.

 

    • Directory – Reads fully-written files from a directory.

 

    • File Tail – Reads lines of data from an active file after reading related archived files in the directory.

 

    • HTTP Client – Reads data from a streaming HTTP resource URL.

 

    • JDBC Consumer – Reads database data through a JDBC connection.

 

    • JMS Consumer – Reads messages from JMS.

 

    • Kafka Consumer – Reads messages from Kafka.

 

    • Kinesis Consumer – Reads data from Kinesis.

 

    • MongoDB – Reads documents from MongoDB.

 

    • Omniture – Reads web usage reports from the Omniture reporting API.

 

    • RPC – Reads data from an RPC destination in an RPC pipeline.

 

    • UDP Source – Reads messages from one or more UDP ports.

 

    • In cluster pipelines, you can use the following origins:

 

    • Hadoop FS – Reads data from the Hadoop Distributed File System (HDFS).

 

    Kafka Consumer – Reads messages from Kafka. Use the cluster version of the origin.

目的地。

    • Cassandra – Writes data to a Cassandra cluster.

 

    • Elasticsearch – Writes data to an Elasticsearch cluster.

 

    • Flume – Writes data to a Flume source.

 

    • Hadoop FS – Writes data to the Hadoop Distributed File System (HDFS).

 

    • HBase – Writes data to an HBase cluster.

 

    • Hive Streaming – Writes data to Hive.

 

    • JDBC Producer – Writes data to JDBC.

 

    • Kafka Producer – Writes data to a Kafka cluster.

 

    • Kinesis Producer – Writes data to a Kinesis cluster.

 

    • RPC – Passes data to an RPC origin in an RPC pipeline.

 

    • Solr – Writes data to a Solr node or cluster.

 

    • To Error – Passes records to the pipeline for error handling.

 

    Trash – Removes records from the pipeline.

处理器也。呜呜。

    • Expression Evaluator – Performs calculations and appends the results to the record.

 

    • Field Converter – Converts the data type of a field.

 

    • Field Hasher – Uses an algorithm to encode sensitive string data.

 

    • Field Masker – Masks sensitive string data.

 

    • Field Merger – Merges fields in complex lists or maps.

 

    • Field Remover – Removes fields from a record.

 

    • Field Renamer – Renames fields in a record.

 

    • Field Splitter – Splits the string values in a field into different fields.

 

    • Geo IP- Provides geographic location information based on an IP address.

 

    • JavaScript Evaluator – Processes records based on custom JavaScript code.

 

    • JSON Parser – Parses a JSON object embedded in a string field.

 

    • Jython Evaluator – Processes records based on custom Jython code.

 

    • Log Parser – Parses log data in a field based on the specified log format.

 

    • Record Deduplicator – Removes duplicate records.

 

    • Stream Selector – Routes data to different streams based on conditions.

 

    Value Replacer – Replaces null values or replaces values with nulls

环境

CentOS 6.5 -> CentOS 6.5

安装和启动

我参考了安装公式的步骤,从tarball文件中安装了软件。只需解压并启动即可。

curl -O https://archives.streamsets.com/datacollector/1.1.4/tarball/streamsets-datacollector-1.1.4.tgz
tar -xvzf streamsets-datacollector-1.1.4.tgz
cd streamsets-datacollector-1.1.4
./bin/streamsets dc #これで起動

开机后 (qǐ

登录后,应该会出现这样的页面,您可以根据直觉进行调整(应该)。

streamsets-1-empty-canvas.png

印象

收集应用程序日志和每小时收集RDBMS性能数据之类的,虽然我努力使用fluentd进行配置并收集,但发现这些过程非常简单得让人感到悲伤。

    • MongoDB -> ElasticSearch

 

    • JDBC(MySQL/Postgres/SQLServer) -> ElascicSearch

 

    ElasticSearch -> ElasticSearch

尽管尝试过其它选项,但这个解决方案很快就起作用了(安装SQLServer或Oracle等供应商提供的JDBC驱动程序并进行配置是必要的)。

起源方面可以通过设置PK和ID来进行偏移量配置,从而实现差分(仅追加)更新,这也是非常棒的。

RPCpipelines.png

我对以后各种功能的扩展非常期待,感觉非常适用于收集临时数据进行数据分析等用途。

滿足需求

我将来会继续使用并随时更新。

广告
将在 10 秒后关闭
bannerAds