流濤集的筆記
简介
我在研究将数据收集到Elasticsearch时,看到了Elastic的官网
https://www.elastic.co/blog/elasticsearch-plus-streamsets-reliable-data-ingestion
我做了一份备忘录来记录和介绍一个看起来非常不错的产品。
大致上来说
我知道可能有一些大的错误,但还是决定大致总结一下。
这是一种可以通过GUI连接输入/转换/输出数据并进行数据建模/收集的工具。
观看视频会更容易理解。
只要有这些,我觉得什么事都能搞定。
主な登場人物説明originsデータ入力元destinationデータ出力先processor変換処理を行うものpipelineこれらをつなげたもの
详细信息可以在http://streamsets.com/documentation/datacollector/1.1.3/help/找到。
目前为止,origins有以下几个选项。太厉害了。
-
- Amazon S3 – Reads files from Amazon S3.
-
- Directory – Reads fully-written files from a directory.
-
- File Tail – Reads lines of data from an active file after reading related archived files in the directory.
-
- HTTP Client – Reads data from a streaming HTTP resource URL.
-
- JDBC Consumer – Reads database data through a JDBC connection.
-
- JMS Consumer – Reads messages from JMS.
-
- Kafka Consumer – Reads messages from Kafka.
-
- Kinesis Consumer – Reads data from Kinesis.
-
- MongoDB – Reads documents from MongoDB.
-
- Omniture – Reads web usage reports from the Omniture reporting API.
-
- RPC – Reads data from an RPC destination in an RPC pipeline.
-
- UDP Source – Reads messages from one or more UDP ports.
-
- In cluster pipelines, you can use the following origins:
-
- Hadoop FS – Reads data from the Hadoop Distributed File System (HDFS).
- Kafka Consumer – Reads messages from Kafka. Use the cluster version of the origin.
目的地。
-
- Cassandra – Writes data to a Cassandra cluster.
-
- Elasticsearch – Writes data to an Elasticsearch cluster.
-
- Flume – Writes data to a Flume source.
-
- Hadoop FS – Writes data to the Hadoop Distributed File System (HDFS).
-
- HBase – Writes data to an HBase cluster.
-
- Hive Streaming – Writes data to Hive.
-
- JDBC Producer – Writes data to JDBC.
-
- Kafka Producer – Writes data to a Kafka cluster.
-
- Kinesis Producer – Writes data to a Kinesis cluster.
-
- RPC – Passes data to an RPC origin in an RPC pipeline.
-
- Solr – Writes data to a Solr node or cluster.
-
- To Error – Passes records to the pipeline for error handling.
- Trash – Removes records from the pipeline.
处理器也。呜呜。
-
- Expression Evaluator – Performs calculations and appends the results to the record.
-
- Field Converter – Converts the data type of a field.
-
- Field Hasher – Uses an algorithm to encode sensitive string data.
-
- Field Masker – Masks sensitive string data.
-
- Field Merger – Merges fields in complex lists or maps.
-
- Field Remover – Removes fields from a record.
-
- Field Renamer – Renames fields in a record.
-
- Field Splitter – Splits the string values in a field into different fields.
-
- Geo IP- Provides geographic location information based on an IP address.
-
- JavaScript Evaluator – Processes records based on custom JavaScript code.
-
- JSON Parser – Parses a JSON object embedded in a string field.
-
- Jython Evaluator – Processes records based on custom Jython code.
-
- Log Parser – Parses log data in a field based on the specified log format.
-
- Record Deduplicator – Removes duplicate records.
-
- Stream Selector – Routes data to different streams based on conditions.
- Value Replacer – Replaces null values or replaces values with nulls
环境
CentOS 6.5 -> CentOS 6.5
安装和启动
我参考了安装公式的步骤,从tarball文件中安装了软件。只需解压并启动即可。
curl -O https://archives.streamsets.com/datacollector/1.1.4/tarball/streamsets-datacollector-1.1.4.tgz
tar -xvzf streamsets-datacollector-1.1.4.tgz
cd streamsets-datacollector-1.1.4
./bin/streamsets dc #これで起動
开机后 (qǐ
登录后,应该会出现这样的页面,您可以根据直觉进行调整(应该)。
印象
收集应用程序日志和每小时收集RDBMS性能数据之类的,虽然我努力使用fluentd进行配置并收集,但发现这些过程非常简单得让人感到悲伤。
-
- MongoDB -> ElasticSearch
-
- JDBC(MySQL/Postgres/SQLServer) -> ElascicSearch
- ElasticSearch -> ElasticSearch
尽管尝试过其它选项,但这个解决方案很快就起作用了(安装SQLServer或Oracle等供应商提供的JDBC驱动程序并进行配置是必要的)。
起源方面可以通过设置PK和ID来进行偏移量配置,从而实现差分(仅追加)更新,这也是非常棒的。
我对以后各种功能的扩展非常期待,感觉非常适用于收集临时数据进行数据分析等用途。
滿足需求
我将来会继续使用并随时更新。