Zi 字媒體

原創譯文 | IBM高級工程師談數據湖管理

2021/12/25

yidianzixun

本文為燈塔大數據原創內容，歡迎個人轉載至朋友圈，其他機構轉載請在文章開頭標註：

「轉自：燈塔大數據；」

「在我看來，數據湖是一種參考架構，在保證信息管理秩序和信息安全的條件下，提供了獲取數據的有效途徑。」

何為數據湖

數據湖參考架構實際是指分析系統必備的技術能力，不依賴於任何其他技術。這種技術獨立是非常重要的，現在有很多公司都投資了數據平台，希望能夠將這些數據平台的技術納入他們的解決方案。此外，技術是在不斷進步的，選擇哪種技術通常取決於待處理數據的數量、種類和產生速度。

分析系統的成功與否並不僅僅取決於它所採用的技術。數據湖參考架構明確了數據管理過程和各種定義的界限，確保技術之外的人力和業務系統能夠展開高效合作，為數據應用創建自助的、安全的環境。

基於數據湖的數據管理

管理的重要性不言而喻。詹姆斯·瓦特發明蒸汽機的時候，同時發明了飛球式調速器管理系統，調節「引擎」平衡，保證工作效率。「引擎」可以是一個工作流程、一個機構組織，或者信息流。對於管理來說，「引擎」就是管理的目標對象，明確管理對象是管理的重點。

根據不同公司數據管理對象的不同，數據湖的管理方式多種多樣。舉例來說，IT部門的數據湖「引擎」是各種技術。業務部門也可以將數據湖視為創新引擎的一部分，幫助他們創造新的數據價值。確定數據湖管理項目的第一步就是考慮數據湖不同用戶群的需求，再考慮什麼樣的機制能夠在不同需求之間達到兼顧平衡。

舉例來說，向數據湖提供數據的系統所有者需要維護來自其系統的數據目錄條目，然後他們就可以獲得對該數據的質量和穩定性的分析，這有助於他們為用戶提供更好的服務。

數據科學家在處理敏感數據時可能會受到各種限制，但是另一方面，他們可以得到豐富的數據目錄，在需要使用特定數據集時，他們也能更容易獲得批准。他們同時還能為該數據目錄提供數據和內容。

他們貢獻的內容越多，他們獲取數據的過程就越容易。通過建立供應商需求和消費者需求之間的平衡，可以實現投入與產出的平衡，創造可持續的生態系統。

數據湖管理者

除了從用戶角度設計管理項目之外，我們還需要確定由誰來控制數據湖，因為數據湖的控制者會影響數據湖的管理方式。如果是IT部門控制數據湖，那麼正常的IT管理方式就能夠滿足數據湖管理的要求。

如果是業務部門管理數據湖，那麼我們就需要通過數據服務和元數據，抽離出數據湖的運行機制，明確不同數據種類的區別，創建數據湖視圖，來幫助業務部門理解和操作。然後，通過目錄中的元數據將此視圖映射到實際的數據和技術中，並且數據湖服務將使用元數據設置來驅動數據湖的運作。

一旦「引擎」確定之後，管理項目就可以進入正常的設計階段：

設定數據湖元數據、格式和最佳實踐標準；
檢驗、監測上述標準的執行；
採取合理方式處理數據異常情況、回答合規問題，並根據反饋進行項目調整。

管理平衡與價值

最後，我想再次強調反饋在實現平衡和價值方面的重要性。管理項目必須是動態的，它必須體現出其自身的價值。反饋機制的重要性也不容忽視，它會提醒項目管理者作出及時調整，應對隨時發生的變化。

英文原文

Four perspectives on data lakes

"My view is that a data lake is a referencearchitecture that balances the desire for easy access to data with informationgovernance and security."

The data lake reference architecture describes thetechnical capabilities necessary for a system of insight, while beingindependent of specific technologies. Being technology independent is importantbecause most organizations already have investments in data platforms that theywant to incorporate in their solution. In addition, technology is continuallyimproving, and the choice of technology is often dictated by the volume,variety, and velocity of the data being managed.

A system of insight needs more thantechnology to succeed. The data lake reference architecture includesdescription of governance and management processes and definitions to ensurethe human and business systems around the technology support a collaborative,self-service, and safe environment for data use.

Governance is a practice that you apply to「something.」 Just like James Watt』s fly-ball governor for the steam engine, agovernance program seeks to keep an engine in balance so it workseffectively. This engine may be aprocess, organization, or flow of information. The important point is that the target of what you are governing isclearly defined.

Approaches to governance, particularlyaround a data lake, vary widely due to the different choices that organizationsmake in their definition of the engine being managed. For example, the ITdepartment may see the data lake engine as a collection of technology workingtogether. The business may see the data lake as part of an innovation enginehelping them to create new value from data. So which is the right engine togovern? It depends on the objective for data lake. A good starting point indefining the governance program for the data lake is to consider theperspective of each of the principle groups of users for the data lake anddefine the engine that each see and think what mechanisms it would take tocreate balance in each of these perspectives between effort and value.

For example, the owner of a system that issupplying data to the data lake is required to maintain the catalog entry forthe data coming from their system, and in return, they could get analysis onthe quality or consistency of this data that helps them provide a betterservice to their users. A data scientist may be restricted in how they workwith sensitive data, but in return they get a rich catalog of data to choosefrom and easy processes to get permission to use the data sets they need. Theymay also be given the ability to contribute data and content for the catalog.The more they contribute, the easier the discovery process becomes. Bybalancing the needs of the suppliers with the needs of the consumers, thebalance of effort and value is achieved, creating a sustainable ecosystem.

In addition to designing the governanceprogram to the perspective of the users, it is also necessary to decide who isin control of the data lake - whether it is IT or the business will affect howthe data lake is governed. When IT is in control, then normal IT governance canmanage many of the aspect of the data lake. However, when the business is incontrol, the mechanisms that operate the data lake, and the classification thatidentify the different types of data, need to be abstracted through servicesand metadata to create a view of the data lake that makes sense to the businessand can be modified by them as needed. This view is then mapped to the actualdata and technology through the metadata in the catalog and the metadatasettings are used by the data lake services to drive the behavior of the datalake.

Once the engine have been defined, thegovernance program is designed in the normal way:

Setting standards for the metadata, formatsand best practices for the data lake.

Measuring and monitoring the adherence to these standards.

Taking action as appropriate such asmanaging exceptions, answering compliance questions and modifying the programbased on feedback.

I would like to end by emphasizing theimportance of feedback in achieving balance and value. Governance programs mustbe dynamic and demonstrating the value that they deliver. The feedbackmechanisms should not be forgotten as they enable the governance program tostay relevant to the changing needs to the business which in turn changes thenature of the engines we need to govern.

翻譯：燈塔大數據

閱讀原文了解更多詳情

桃園 qq 地點貓咪桃園市 taoyuan xuan 根部尾巴有大桃園旅遊景點