当前位置：网站首页>Introduction, type selection comparison and architecture design of common distributed file storage

Introduction, type selection comparison and architecture design of common distributed file storage

2022-06-28 17:05:00 【Full stack programmer webmaster】

Hello everyone , I meet you again , I'm your friend, Quan Jun .

Hello, I'm cuckoo ：

Before, when docking storage projects , Sort out the file system used in the company , The file systems currently used within the company are GlusterFS,FastDFS etc. , The performance of the file system drops sharply due to the large number of small files and high concurrency , Performance bottlenecks , So we plan to build a distributed object storage platform . Next, we will sort out and compare the popular unstructured file storage products in the market .

Distributed storage source files

In this era of data explosion , The amount of data generated is increasing , from GB,TB,PB,ZB. Mining the value of data is also the ultimate goal of enterprises . But if you want to mine massive data , The first thing to consider is the storage of massive data , such as Tb Data of magnitude .

When it comes to data storage , What we have to say is the speed of data reading and writing on disk . Back in the last century 90 In the early s , The storage capacity of an ordinary hard disk is about 1G about , The reading speed of the hard disk is about 4.4MB/s. It takes about 5 Minute time , But today, the capacity of hard disks is 1TB Around the , Compared with the expansion of nearly a thousand times . But the reading speed of the hard disk is about 100MB/s. The time it takes to read a hard disk is about 2.5 Hours . So if it's based on TB Level of data for analysis , It's been several days since the optical hard disk finished reading data , Not to mention calculation and analysis . So how to deal with the storage of big data , How about calculation and analysis ？

Common distributed file storage

Common distributed file systems

GFS、HDFS、Lustre 、Ceph 、GridFS 、mogileFS、TFS、FastDFS etc. . They are suitable for different fields . None of them are system level distributed file systems , But application level distributed file storage Storage services .

Distributed file storage selection comparison

Well known open source distributed file storage

1.GFS（Google File System）

Google In order to meet the needs of our company, we have developed a product based on Linux Proprietary distributed file system . Even though Google Some technical details of the system are published , but Google The software part of the system is not released as open source software .

2.HDFS

Hadoop Implemented a distributed file system （Hadoop Distributed File System）, abbreviation HDFS. Hadoop yes Apache Lucene founder Doug Cutting Developed a widely used text search library . It originated in the Apache Nutch,

The latter is an open source web search engine , Itself is also Luene Part of the project .Aapche Hadoop The architecture is MapReduce An open source application of the algorithm , yes Google An important cornerstone of his empire .

3.TFS

TFS（Taobao FileSystem） It's a highly scalable 、 High availability 、 High performance 、 Distributed file system for Internet services , Mainly for massive unstructured data , It's built on ordinary Linux machine On the cluster , It can provide high reliability for the outside

And highly concurrent storage access .TFS Taobao provides massive small file storage , Usually the file size does not exceed 1M, It meets Taobao's demand for small file storage , Widely used In various applications of Taobao . It USES a HA Architecture and smooth expansion , Ensure the availability and scalability of the whole file system . At the same time, flat data organization structure , You can map the file name to the physical address of the file , simplify This paper introduces the file access process , To some extent TFS Provides good read and write performance .

Google Academic papers , This is the origin of many distributed file systems ,HDFS and TFS It's all for reference Google Of GFS Design Coming out .

Typical architecture design of distributed file storage

I take hadoop Of HDFS For example , After all, open source distributed file storage is used the most .

Hadoop distributed file system (HDFS) Designed to run on universal hardware (commodity hardware) Distributed file system on .HDFS It's a highly fault tolerant system , Suitable for deployment on cheap machines .HDFS It can provide high throughput data access , Very suitable for large-scale data set applications .HDFS A part of the relaxation POSIX constraint , To achieve streaming read file system data .

Large data sets

Running on the HDFS The application on has a large data set .HDFS A typical file size on is usually G Bytes to T byte . therefore ,HDFS Tuned to support large file storage . It should be able to provide overall high data transmission bandwidth , It can expand to hundreds of nodes in a cluster . A single HDFS Examples should be able to support tens of millions of files .

A simple consistency model

HDFS The application needs a “ Write multiple reads at a time ” File access model for . A file is created 、 After writing and closing, there is no need to change . This assumption simplifies the problem of data consistency , And make high-throughput data access possible .Map/Reduce Applications or web crawler applications are very suitable for this model . There are plans to expand this model in the future , Make it support additional write operations of files .

Portability between heterogeneous hardware and software platforms

HDFS The portability of the platform is considered in the design . This feature facilitates HDFS As the promotion of large-scale data application platform .

Namenode and Datanode

HDFS use master/slave framework . One HDFS The cluster is made up of a Namenode And a certain number of Datanodes form .

Namenode It's a central server , Responsible for managing the file system's namespace (namespace) And client access to files .

In the cluster Datanode It's usually one node and one , Responsible for managing the storage on the node where it resides .HDFS Exposed the namespace of the file system , Users can store data in the form of files . From the inside , A file is actually divided into one or more data blocks , These blocks are stored in a set of Datanode On .

Namenode Perform namespace operations on the file system , Like opening 、 close 、 Rename a file or directory . It's also responsible for identifying data blocks to specific Datanode Mapping of nodes .Datanode Responsible for handling the read and write requests of the file system client . stay Namenode Data block creation under unified scheduling 、 Delete and copy .

Namenode and Datanode Designed to run on ordinary commercial machines . These machines generally run GNU/Linux operating system (OS).HDFS use Java Language development , So any support Java All machines can be deployed Namenode or Datanode. Due to the adoption of highly portable Java Language , bring HDFS It can be deployed to many types of machines . A typical deployment scenario is to run only one Namenode example , The other machines in the cluster run one Datanode example . This architecture does not exclude running multiple on one machine Datanode, It's just that it's rare .

The future of distributed storage

With the transition of modern society from the industrial age to the information age , The development of information technology and the intellectualization of human life bring explosive growth of data , Data is becoming the most valuable resource in the world .

According to the physical storage form , Data storage can be divided into centralized storage and distributed storage . Centralized storage is based on traditional storage array （ Traditional storage ） Mainly , Distributed storage （ Cloud storage ） Mainly software defined storage .

Traditional storage Always with high reliability 、 Good stability , It is famous for its rich functions , But at the same time , Traditional storage also exposes poor horizontal scalability 、 Expensive 、 Data connectivity is difficult , It is easy to form a data island , Leading to high cost of data center management and maintenance .

Distributed storage ： Store data in multiple independent devices on the network , Generally, the standard x86 Server and network interconnection , And run the relevant storage software on it , The system provides storage services as a whole ..

All in all , Distributed file storage , It not only improves the utilization of storage space , It also achieves elastic expansion , Reduced operating costs , Avoid waste of resources , It is more suitable for the future scenario of data explosion era .

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/132769.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/179/202206281640521271.html