Zi 字媒體

原創譯文|你應該知道的18個大數據工具

2021/12/25

yidianzixun

轉載聲明

本文為燈塔大數據原創內容，歡迎個人轉載至朋友圈，其他機構轉載請在文章開頭標註：

在當今的數字革命浪潮中，大數據成為公司企業分析客戶行為和提供個性化定製服務的有力工具，大數據切切實實地幫助這些公司進行交叉銷售，提高客戶體驗，並帶來更多的利潤。

隨著大數據市場的穩步發展，越來越多的公司開始部署大數據驅動戰略。

Apache Hadoop是目前最成熟的大數據分析工具，但是市場上也不乏其他優秀的大數據工具。目前市場上有數千種工具能夠幫你節約時間和成本，帶你從全新的角度洞察你所在的行業。

以下介紹18種功能實用的大數據工具：

Avro：由Doug Cutting公司研發，可用於編碼Hadoop文件模式的數據序列化。

Cassandra：一種分散式的開源資料庫。可用於處理商品伺服器在提供高可用性服務時產生的大量分散式數據。這是一種非關係型資料庫（NoSQL）解決方案，最初由Facebook主導研發。

目前很多公司組織都在使用這一資料庫，如Netflix，Cisco，Twitter。

Drill：一種開源分散式系統，用於大規模數據集的交互分析。Drill與谷歌的Dremel系統類似，由Apache公司管理運行。

Elasticsearch：Apache Lucene開發的開源搜索引擎。Elasticsearch是基於Java的系統，可以實現高速搜索，支持你的數據搜索工作。

Flume：使用網路伺服器、應用伺服器和移動伺服器的數據來填充Hadoop的大數據應用框架，是數據源和Hadoop之間的一種連接紐帶。

HCatalog：是針對Apache Hadoop的集中元數據管理和分享服務。可以通過它集中查看Hadoop集群中的所有數據，並可以在不知道數據在集群中存儲位置的情況下，通過Pig和 Hive等多種工具處理所有數據元素。

Impala: 使用與Apache Hive相同的元數據，SQL語法（Hive SQL），ODBC驅動程序和用戶界面（HueBeeswax），直接幫助您對存儲在HDFS或HBase中的Apache Hadoop數據進行快速的互動式SQL查詢。

它為批量導向或實時查詢提供了一個方便操作的統一平台。

JSON：今天的許多非關係型資料庫（NoSQL）都以JSON（JavaScript對象符號）格式存儲數據，這些格式在Web開發人員中很受歡迎。

Kafka：這是種分散式「發布——訂閱」的消息傳送系統，它能夠提供一種解決方案，幫助處理所有數據流活動，並在消費者網站上處理這些數據。

這種類型的數據（包括頁面查看數據，搜索數據和其他用戶操作數據）是當前社交網路的關鍵組成部分。

MongoDB：是一個在開源概念指導下開發出來的面向文檔的非關係型資料庫（NoSQL）。它具有完整的索引支持，同時可以靈活地對任何屬性進行索引，並在不影響功能的情況下進行橫向擴容。

Neo4j：是一個圖形資料庫，與關係資料庫相比，性能提升高達1000多倍或更高。

Oozie：一種工作流程處理系統，可以讓用戶自定義不同語言編寫的一系列工作，如Map Reduce，Pig 和 Hive。它還可以實現不同工作項目之間的智能連接，Oozie還支持用戶指定依賴關係。

Pig：是由雅虎開發的基於Hadoop的一種語言，對於用戶來說，學習起來相對簡單，且Pig擅長處理非常深入且非常長的數據管道（data pipeline）。

Storm：是一種免費的進行實時分散式計算的開源系統。通過Storm，用戶可以非常輕鬆的在能夠進行實時處理操作的範圍內，對非結構化數據流進行可靠處理。

系統具有容錯特性，支持幾乎所有編程語言，當然最常用的語言還是Java。Storm最初是Apache家族的一個分支，現在已被Twitter收購。

Tableau：是一種主要關注商業智能的數據可視化工具。用戶無需編程，就可以利用Tableau創建地圖，條形圖，散點圖等可視化圖像。

他們最近發布了一個Web連接器，允許用戶直接連接資料庫或應用程序界面（API），從而使用戶能夠在進行可視化項目時獲取實時數據。

ZooKeeper：為大型分散式系統提供集中配置和開放代碼名稱註冊的服務。

每天大數據技術領域都會湧現出大量新的大數據相關工具，要想學會使用每個工具是非常困難且沒有意義的。挑選幾個你能夠熟練使用的工具，並不斷學習技術知識，才是最好的方式。

英文原文

18 Big Data Tools You Need To Know About

Use these tools to get ahead

In today』s digital transformation, big datahas given organizations an edge to analyze customer behavior &hyper-personalize every interaction which results into cross-sell, improvedcustomer experience, and obviously more revenue.

The market for Big Data has grown upsteadily as more and more enterprises have implemented a data-driven strategy.

While Apache Hadoop is the most well-established tool for analyzing big data,there are thousands of big data tools out there.

All of them promising to saveyou time, money, and help you uncover never-before-seen business insights.

I have selected few to get you going….

Avro: It was developed by Doug Cutting& used for data serialization for encoding the schema of Hadoop files.

Cassandra: is a distributed and Open Sourcedatabase. Designed to handle large amounts of distributed data across commodityservers while providing a highly available service.

It is a NoSQL solution thatwas initially developed by Facebook. It is used by many organizations likeNetflix, Cisco, Twitter.

Drill: An open source distributed systemfor performing interactive analysis on large-scale datasets. It is similar toGoogle』s Dremel, and is managed by Apache.

Elasticsearch: An open source search enginebuilt on Apache Lucene. It is developed on Java, can power extremely fastsearches that support your data discovery applications.

Flume: is a framework for populating Hadoopwith data from web servers, application servers and mobile devices. It is theplumbing between sources and Hadoop.

HCatalog: is a centralized metadatamanagement and sharing service for Apache Hadoop.

It allows for a unified viewof all data in Hadoop clusters and allows diverse tools, including Pig andHive, to process any data elements without needing to know physically where inthe cluster the data is stored.

Impala: provides fast, interactive SQLqueries directly on your Apache Hadoop data stored in HDFS or HBase using thesame metadata, SQL syntax (Hive SQL), ODBC driver and user interface (HueBeeswax) as Apache Hive.

This provides a familiar and unified platform forbatch-oriented or real-time queries.

JSON: Many of today』s NoSQL databases storedata in the JSON (JavaScript Object Notation) format that』s become popular withWeb developers

Kafka: is a distributed publish-subscribemessaging system that offers a solution capable of handling all data flowactivity and processing these data on a consumer website.

This type of data(page views, searches, and other user actions) are a key ingredient in thecurrent social web.

MongoDB: is a NoSQL database oriented todocuments, developed under the open source concept. This comes with full indexsupport and the flexibility to index any attribute and scale horizontallywithout affecting functionality.

翻譯：燈塔大數據

桃園 qq 地點貓咪桃園市 taoyuan xuan 根部尾巴有大桃園旅遊景點