当前位置:网站首页>Interview with Ant: How do these technology pioneers do the bottom-level development well?| Excellent technical team interview
Interview with Ant: How do these technology pioneers do the bottom-level development well?| Excellent technical team interview
2022-07-30 08:06:00 【Alipay Technology】
This is a native of cloud infrastructure r&d team,They want to develop a set of cloud native architecture components,Let the enterprise did not concern to focus on business development,PaaS、Service Mesh、Custom hardware such as far away from the business work is they.This is the trusted native ant group team.
“We do the better job,The upper application can expand more,The infrastructure development and infrastructure developers, the number of proportion may be higher.”Ants, senior technical experts、Kata Containers Co-founder wang said.What's in infrastructure development difficulties?And what is the distinguishing feature of business research and development than?This time we had an interview with trusted native team respectively responsible for middleware、Container and reliable technology, a senior expert,To introduce what are they doing things for us and some thinking.
Ants middleware is as business architecture evolution together.
The change of the ant architecture may have had the following:2003-2005 Years of monomer type architecture、2006-2009 Years of service architecture、2010-2013 Years of unitized structure、2014-2017 Years of cloud financial architecture and 2018 So far this year cloud native architecture.
Behind the evolution of architecture is more and more、越来越复杂的业务.The middleware can be traced back to 2009 In alipay architecture.彼时,The ant is not from taobao spin-off,Pay treasure of the first generation architecture therefore is highly affected by the taobao.2014 年,The ant group formally established,At the same time have begun preparing for electronic retailing bank,Want to be the first bank to cloud.This means that the pay treasure to the architecture of the scene from a pure trading to the broader financial sector.
SOFA Is the ant group independent research and development of the financial level distributed middleware,包含了构建金融级云原生架构所需的各个组件,如服务网格、Service Mesh、消息中间件等.Experienced a modular、As a service and unitized,SOFA Team for the fourth generation SOFA The new name is SOFAStack.
The ant group senior technical experts Song Shun as the only responsible for SOFA The middleware in the research and development team.It seems to Song Shun,Financial level middleware need to have security、稳定、可靠、Efficient characteristics,具体包括:
The ability to service requires high availability,At least the same double live、Beyond the cold standby.Ants mechanism through the registry implementation with different room service exchange in town.
Service requires higher extensibility to cope with the flow increase.Ant select unitization way,Such as the traffic break up,按照用户 ID 做划分,In different parts of the user to select a specific area to computer maintenance and so on.
Data has strong reliability and consistency.对此,Ants introduces distributed transaction technologies such as,To ensure that different room trading success or failure at the same time.
Second level monitoring ability,Can quickly respond to traffic surges of.
如今,SOFAStack Has entered the fifth generation of development.
SOFAStack 产品布局
根据设想,Ants of the next generation technology architecture or to Mesh As the core,Therefore trusted original team hopes to all middleware ability to grind.SOFAStack Also is already actively embrace and layout Mesh.比如,由于 Service Mesh Is composed of control plane and data plane,Therefore ants respectively for two open source projects SOFAMesh 与 SOFAMosn 的研发.
SOFAMesh 是 Service Mesh 的控制平面,由 Istio 分叉而来,And according to the internal need to do the features added、优化和改进.不过,In the back to Istio 后,Song Shun said the project would return to the community,The future is no longer maintained separately.
MOSN 则是一个用 Golang Write the data plane.根据计划,MOSN Shouldering the responsibility of carrying all network middleware ability,This is a need to multi-sectoral participation、A long time span big project.However, the existing Istio 是 C++ 写的,Ants internal need spend more manpower to do research and development,最终决定基于 Golang 做了 MOSN.
“Technology is the underlying,In the face of the more sophisticated things,We do anything will be the more difficult it is to.”Song Shun said,“比如,If you want to whole Mesh 升级,The ant is facing hundreds of thousands of nuclear machine modification,This for the team and don't have much experience can be for reference.When there is no reference things,Only in a new way to solve.”But Song Shun said,Innovation requires cycle,Landing process will be relatively slow,Especially for larger scale enterprise.
在蚂蚁内部,Technical concepts are determines whether it will be used.This is shown in the ant to introduce Serverless The cautious attitude.
“Serverless 概念很大、也很好,But how can you make it work under the system、产生真正的价值,是需要探索的,Not just say there's a new technology、New concept will be directly with the.”
It seems to Song Shun,Service Mesh Solve the decoupling problem is more research and development of infrastructure and business,It is easier to popularize.而 Serverless Is not only a kind of technology.“团队希望 Serverless Can bring the entire development process、Research and development mode change,以此提高研发效率,But this time is very long.”
目前,After more than two years of grinding,蚂蚁的 Serverless Products have completed the first phase of the ground and promotion.SOFAServerless Internally connected to the 700 多个 Java、nodejs 应用,基本涵盖了蚂蚁所有业务线,支撑了 1 Thousands of times full production development iteration.
As the business more diverse,Business may be distributed in different cloud.But different requirements for different business compliance,Also need to separate between、Each other again.这种形势下,How do architecture support,Is one of the important challenges facing the ants.
为此,Song Shun said,The current team in developing a set of named“Clouds a”的产品,Aimed at blocking the business between the cloud and the details of the,Users only need to be developed,Without having to pay attention to whether on different cloud.
Cloud in the original system,Container provides the foundation for the software system of runtime environment,At present has become in fact application delivery standards.
2014 Years is the height of the container technology,But due to poor isolation,Security has been hanging in the developer heads the blade of a dharma Chris.StackRox According to a report earlier this year,55% The respondents for security reasons,The delay will be K8s Applications are deployed into production.
2015 年时,The speed of sound child prodigy seized the container and the cloud of,Based on virtualization technology is open source container engine runV.2017 年 12 月,在与英特尔的 Clear Containers 合并后,To unify the virtual machine(VM)Security advantages at speed and manageability of the Kata Containers 诞生.
2019 年,The founder of sonic prodigy zhao peng、Wang joined the ant,Expanded its talent in infrastructure field map.至今,Kata Two major evolution,现在到了 Kata 3.0 Version of research and development of nervous moments.According to the introduction of wang,Kata 3.0 Basically has the following three aspects: the improvement of:
更多 Rust 的实现.Kata 之前用 GoLang,在改用 Rust 实现 agent 后,占用内存从 11 MB 下降到约 1 MB.Kata Hope that the future all components can be unified with Rust 来编写,届时 Kata The runtime environment will be from the original two processes will be unified into Rust Kata Runtime,运维、Memory overhead greatly streamline, etc.
Introduce image acceleration.Layered in the container stored in a deleted file will cause redundant,The other container restart will waste a lot of.所以,Kata 3.0 Introduced the container mirror to accelerate the project Nydus.
The support of the classified calculation.实际工作中,Many users will distrust container provider,Don't want to own the content of the container by container provider to spy out to.因此,Kata 3.0 Introduced the confidential calculation.
“3.0 Something in version is sure to add,Some things just might,也可能会等到 3.1 Version to completely realize.”王旭说.据悉,Kata 3.0 Will be released at the end of the year.
在王旭看来,A relatively neutral will open source community has a few little cleanliness of correctness,Does not compromise on quality in order to get online.
“我印象里,当年 8 Beginning with the first version of the,一直到 10 Finally merge into late.Then the architecture of the commission to the developer to do a lot of work,Want to ensure the correctness of the core functions,Shortage of place can be repaired in the later version,Try not to the time line stretched,Other development.尽管如此,Everyone still in terms of quality of core not let go.”
Kata The technology evolution route mainly by the open source community、The upstream community feedback and other participants do together,All development will also be in the open source community synchronous.
无论 SOFAStack 还是 Kata ,Chose to open source,实际上,Ant project more and more after the relatively mature open source.
Open source products are more likely to get trust from developers,And with the power of the open source community to,激发创新,Can get the original product design direction from new scene,丰富产品功能.另外,A community works would better collaboration and management mechanism for product,It is not easy to be a single scene、A single user and is influenced by a single contribution to the party.
“很大程度上,Open source software than many Internet companies within the product,The standardization of the better and better processes,Because a community of open source software have higher request for the correctness of the whole product.”王旭表示.
And in Song Shun view,Technology to become the product,You need with the open source system completes the integration.Song Shun transition team has been advocating technology、产品商业化,Otherwise can't into the ecological,Also can't enjoy the ecological bonus.“Through the open source to communicate with the developers、Work with other community,We can make the product system into open source ecological,Ensure that way won't go partial.”Song Shun said.
The other is commercial point of view,Companies use products always want to own r&d can easily,The open source products can help new industry、Even students better learning technology,Another way is also a talent pool.软件开源,Puts forward higher requirements on the security and.At present the most concern is the software security issues and supply chain security question.
在王旭看来,A lot of open source code with holes,Is actually caused by imperfect process.因此,Functioning of the open source community to have a perfect security response mechanism or team,来处理安全事件.
“Quality again good software is hard to avoid has holes,But how to deal with after a hole、Do you have any related processes,This is a fundamental difference between between the different project.Now we are doing about the discussion of open source software,Will be a special mention safety problem,This is not to be ignored、And have to cope with the serious.”
实际上,The security needs of the community of open-source software collaborative processing.发现问题后,Developers will be able to contact the upstream manufacturers or the open source community,Trigger the corresponding security response process.
A more complicated problem is open source supply chain security.
无论开源还是闭源,Big companies will consider supply chain security.不同的是,A lot of open source software copyright holder is not a company or a legal entity,It is personal author,The open source software to redistribute can cause many restrictions.It also makes the enterprise often face a certain part of open source software copyright is not their own,Can't just turn off or change the permission、Use an open source software once change license,Then the problem such as version also will not be able to use.
因此,Enterprise developers hope to get the source code,Don't rely too much on manufacturer.Wang believed that,Security of open source supply chain in open source projects for community has strong ability of management、Can be independently controlled,And the copyright has strong ability of maintenance, etc.
Though more and more enterprises to choose the cloud,But because of concerns about data being seen or missing,Enterprises for important、The application of the sensitive or select run locally.另外,多个数据方(Usually a competitive relationship)Increasingly, the data collection to the cloud for artificial intelligence training, etc,But don't want to own data is one of the party to take.因此,Privacy enhancement technique is more and more attention.
In numerous privacy enhancement technique,Ants in the infrastructure side choose both general and efficient、Can be used alone can also combine with other technology of the trusted execution environment (TEE) 技术.
TEE Provides a support by the hardware、Can be remote prove authenticity and integrity“黑盒子”.The combination of hard and soft upper application can be safe、To accelerate circulation business performance data,At the same time to the upper application software、Business is combined with hardware,Formation of software and hardware products.
但由于 TEE A lack of like Linux 这样的“操作系统”,一些流行的 AI 训练框架、Data processing framework such as hard to in“黑盒子”里跑.所以,The trusted execution environment in production is not good.如果使用 TEE,Companies need to do a lot of,Even many applications even transform can't complete,这导致 TEE 无法大规模使用.
Ant to build their own better privacy protection ability,Will need to overcome this problem.2016 年,闫守孟 start thinking about how to improve TEE 软件的开发效率.2019 年,闫守孟 after joining the ants founded the classified calculating team,并发起了 Occlum TEE OS 项目,Committed to substantially increase TEE 的易用性和安全性.2020 年,Team published by confidential secrets to calculate the widely recognized by industry and academia paper《Occlum: Secure and Efficient Multitasking Inside a Single Enclave of Intel SGX》.
“Initially not putting too many resources,After gradually had something behind,Do you think can also invest only after.”This paper is the important achievements.
Occlum Using the high security Rust 编写,支持多种文件系统,And provide the class Linux The development experience and class Docker 使用体验.2020 年,蚂蚁将 Occlum 开源,Occlum In the same year becomes a CCC 机密计算联盟中第一个由中国公司发起的开源项目.It has in the large-scale deployment of ant,And for Microsoft Azure Cloud、阿里云、Many open source projects、Privacy computing company USES.
“对 Occlum 来说,Paper can be referred to as the exploratory stage before.Release means to explore a preliminary results,Business combining after we started to do the transition,Today with cloud native development process have the very good combination”闫守孟 say.
当时,复杂多样的 Enclave Hardware platform brings a lot of learning and using burden,Ants, hope to have unified Enclave 抽象,And support more autonomous remote control proved.The is combined with virtualization technology and ant TPM 技术的自主 TEE 平台 HyperEnclave 的由来.
不过,HyperEnclave The risk of the project than Occlum 更大.“Occlum 所基于的 SGX 技术,Before I in Intel is has many years of accumulation of.But this project is new,And the use of virtualization and TPM Technology is not my original best part.The key is in our country on the underlying technology talent reserve is insufficient,Relevant talent market is very little,不好招 ”,闫守孟 recalled,“幸好,Interested in the underlying,Such as aspiring play the part of an operating system or some,We will recruit a trained、But there are feelings of engineer,And then learn by doing、边学边做.”
Basic technology into larger,But effective cycle is very long,Some people wait for will leave.“三年来,A lot of technical and non-technical challenges and twists and turns,Many people failed to stick,Chose to exit the program or even leave the company.HyperEnclave Project most of the time only 3 个人,From the beginning to now should have 4 Personal leave.” 闫守孟 sighs.
19 年下半年,闫守孟 judgment independently controllable will become more and more important.实际上到了 20 年左右,Companies are stepping up investment to do independently controlled,HyperEnclave The value of suddenly become a lot,Is also the anticipation to HyperEnclave Project of the main researchers insist on down.他们将 HyperEnclave The trust of the root managed to like Chinese financial certification center(CFCA)The national information security infrastructure,Further increased the authority and autonomy.
目前,HyperEnclave As adaptive domestic CPU 且兼容 SGX Ecological independent general TEE 系统,Have ants in internal and external customers key scene actual deployment.Three years of the condensation of technological achievements into academic papers system has also been recently received area roof will USENIX ATC'22 的录用通知.“今天,Stick the students,Has become the virtualization technology、可信计算技术、Confidential top experts in the field of computing.Their resistance to the no man's land alone lonely,Resist the temptation of short-term gain,终于守得云开见月明!” 闫守孟 very regrets.
Occlum 和 HyperEnclave Is for a single compute node,But cloud native areas there are a lot of based on Kubernetes 的大规模集群,And the environment also does not apply to confidential calculation.闫守孟 team began to Kubernetes 和 TEE 结合起来的 KubeTEE 项目.KubeTEE Allows users to large-scale cluster remote attestation service and don't need to care about TEE 细节,The cluster key distribution and synchronization service for TEE Also support distributed computing.
Occlum Let the single node TEE 更容易使用,KubeTEE From single node is expanded to multiple nodes and cluster,HyperEnclave Provides independent general TEE 平台,Three projects together to form the ant open source SOFAEnclaves 技术栈,Solve the confidential calculation at present three problems in the practical application.
目前,The ant combines the upper application software and business,Has formed a series of software and hardware integration product.“Depth of the hardware and software products through the system level performance tuning,And through the optimal compatibility matching docking test,To form a complete solution can be directly deployed.So we delivered to customers is basically on the electric box,No additional deployment 、调优 、A complete set of product such as compatibility test,Make customer deployment and operations are very easy to,Let the customer can focus more to the business itself to.”The ant group senior technical experts Kong Jincan said.
“We are very hope to be able to create a default trusted、安全的基础设施.But the foundation of trusted security system must be a hardware.So we are in the center of the data for a few years ago、边缘计算与 IoT The software and hardware product development.包括芯片、板卡、服务器、All-in-one, and many other products.For authentic native ramming foundation.”
Kong Jincan said,In authentic native technical system,Software and hardware combined with mainly three aspects problem.
一是安全加固,Such as we have done Blade The trusted root and password chip,The trusted boot、可信度量、The key protection hardware packaging,And obtained enough close review qualification.
Secondly, the performance acceleration,Such as our password card and privacy computation accelerator card,Cipher algorithm of frequent use of hardware acceleration,Can effectively improve the performance of the application,And thus reduce mass credible native technology system needed to pay the cost of.We by the combined method of software and hardware,利用 FPGA、ASIC 芯片,Accelerate the secret algorithm、同态加密、Privacy intersection time-consuming calculation such as.
Three is to be able to form a good combination of hard and soft products,Help to the side of the business in the user be born.We adopt the means of software and hardware combination,Research and development of a reliable native hardware and software base,Built a reliable native team since the research of the hardware and software products,We can be trusted native overall output.基于此,We combine the ant business application,Successively introduced the moss privacy computing machine、区块链一体机、OceanBase Database all-in-one, etc,Have a good market feedback.
The ants have formed including the underlying trusted software stack and trusted hardware components such as a set of complete hardware and software reliable base,Specific content can refer to below:
SOFABoot、SOFARPC、SOFARegistry 、Seata Such as to build stable、可靠、An efficient distributed system;MOSN、Realtor Such as cloud native scene,Used to implement the business applications and infrastructure decoupling;而 Occlum、HyperEnclave、Kata Such as more focus on security,For the cloud native scenarios provide a safer environment.Trusted native team is responsible for many projects,Division of each project is different,But combination is an organic whole.The design of team always keep open,Ensure that each component has other open source components to do integration and replace.
A trusted native team culture can be summarized as three words:开放,担当和高标准,This is decided by the nature of the technology research and development of facilities.
首先,Infrastructure development need to explore the essence of technology,After a lot of technology takes a lot of discussion will better long-term development of.其次,Underlying components to support the whole stack of business,Write the wrong one line of code that can cause very serious consequences,而一旦出现问题,Also to be able to stand out and carried the pressure solution.因此,Infrastructure development to have the courage to challenge and ability.最后,Infrastructure services not only own,For the development of the whole industry will produce some promote,This is the support behind the supreme excellence technology.
“Each application of the high costs of individual optimization,Will cost a lot of manpower and material resources.The infrastructure team your core abstracting the,Only needs to maintain and optimize the few software,In most cases performed better than let each application team to optimize.”王旭说.
Enterprise development cannot leave the business,But the infrastructure development is to develop a plan of developing when more,The plan may has nothing to do with specific business.所以,Infrastructure development results sometimes it's hard to measure.
But like 闫守孟 said,“After some things is not a business to ask can quickly make,So we must have our own insight,Can do some ahead of business,Rather than follow the rhythm of their.”The bear the exploration of the advanced technology team,作为先锋,Will continue to move forward.
嘉宾介绍:
王旭:The ant group trusted native senior technical experts of the technical department、木兰开源社区 TOC 成员,The open infrastructure foundation top project Kata Containers 的联合发起人.在加入蚂蚁集团之前,He is a open source entrepreneurs in the field of security container,他们在 2015 年Based on virtualization technology is open source container engine runV,在 2017 年 12 月,他们和 Intel 一起宣布 runV 与 Clear Containers 项目合并,成为 Kata Containers 项目.
宋顺:蚂蚁集团高级技术专家,Apollo Config PMC.在微服务架构、Distributed computing, and other fields has a wealth of experience,2019 年加入蚂蚁集团,Currently focused on cloud native and micro service direction,如 Service Mesh、Serverless、Application Runtime 等.
Kong Jincan:The ant group hardware and software, a senior technical experts,In a data center including computing storage products,网络产品,And all-in-one PC hardware and software system architecture has abundant experience in.After joining the ant is responsible for the hardware and software architecture and infrastructure products,Including such as financial cloud / 数据库 / Privacy calculation and system integration products such as.
闫守孟:蚂蚁集团研究员,The ant privacy, head of the computing infrastructure.He led the ant group SOFAEnclave(Occlum、HyperEnclave、KubeTEE 等)Confidential calculation software stack development,Initiate and dominate the domestic and foreign many TEE 标准的制定.加入蚂蚁之前,他在 Intel China is engaged in the basic technology research institute,A number of research results applied in Intel In the software and hardware products.他在 PLDI、ASPLOS、ATC、ASE Such as published more than paper top will,并拥有 30 余件专利.In northwestern polytechnical university he received a doctor's degree in computer application technology professional.
本文选自《中国卓越技术团队访谈录》(2022 年第二季),本期精选了微软 Edge、蚂蚁可信原生、明源云、文因互联、Babylon.js 等技术团队在技术落地、团队建设方面的实践经验及心得体会.
《中国卓越技术团队访谈录》是 InfoQ 打造的重磅内容产品,以各个国内优秀企业的 IT 技术团队为线索策划系列采访,希望向外界传递杰出技术团队的做事方法 / 技术实践,让开发者了解他们的知识积累、技术演进、产品锤炼与团队文化等,并从中获得有价值的见解.
本文分享自微信公众号 - 支付宝技术(Ant-Techfin).
如有侵权,请联系 [email protected] 删除.
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享.
边栏推荐
- The calculation and source code of the straight line intersecting the space plane
- 人工肌肉智能材料新突破
- 分布式系统中的开创者—莱斯利·兰伯特
- go : go-redis set operations
- roslyn folder under bin folder
- Table with tens of millions of data, how to query the fastest?
- What new materials are used in the large aircraft C919?
- [GO Language Basics] 1. Why do I want to learn Golang and get started with GO language
- node.js中实现对数据库的链式操作
- go : go-redis list操作
猜你喜欢
How does Redis prevent oversold and inventory deduction operations?
[GO Language Basics] 1. Why do I want to learn Golang and get started with GO language
The CTO said I was not advised to use SELECT *, why is that?
这个终端连接工具,碾压Xshell
DNS domain name resolution services
MySQL master-slave replication configuration construction, one step in place
Is it possible to use the same port for UDP and TCP?
识别“数据陷阱”,发现数据的可疑之处
让百度地图生成器里的“标注”内容展开--解决方案
What new materials are used in the large aircraft C919?
随机推荐
golang : Zap log integration
Boot process and service control
go : delete database data using grom
[GO Language Basics] 1. Why do I want to learn Golang and get started with GO language
使用navicat连接mysql数据库时常报的错误:2003、1698、1251
万能js时间日期格式转换
Headline 2: there are several kinds of common SQL errors in MySQL usage?
From catching up to surpassing, domestic software shows its talents
Electron使用romote报错 : Uncaught TypeError: Cannot read property ‘BrowserWindow‘ of undefined
分布式系统中的开创者—莱斯利·兰伯特
分布式锁开发
五号黯区靶场 mysql 注入之limit注入记录
General Lei's personal blog to see
go : 使用gorm查询记录
Go 使用 freecache 缓存
ArrayList
2020 数学建模之旅
Go: go - redis based operation
Is it possible to use the same port for UDP and TCP?
【day5】数组