Large scale pre training model needs large scale benchmark To verify that .
stay Jeff Dean wait forsomeone Pathways The first model of architecture PaLM in , The researchers worked on a project called BIG-Bench And other algorithms are tested on a large model specific benchmark . In recent days, , Google will finally BIG-Bench My thesis and GitHub Go public . The researchers say , It took two years to complete the work , Paper length 100 page , The author has 442 people , at present benchmark The included tasks have been changed from PaLM Of the thesis period 150 Increased to more than 200 individual .![]()

BIG-bench It is a set of new benchmarks for evaluating language models of various scales , Google AI person in charge Jeff Dean I like this work . The paper 《Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models》
![]()
- Thesis link :https://arxiv.org/abs/2206.04615
- GitHub:https://github.com/google/BIG-bench
As the scale continues to expand , Language models demonstrate quantitative improvements and new qualitative capabilities . Despite their potentially transformative impact , However, its new functional features are still very poor . In order to provide more information for future research , Prepare for disruptive new model capabilities , It is important to understand the current and recent capabilities and limitations of the language model . To meet this challenge , Google has come up with a benchmark that goes beyond imitation games (Beyond the Imitation Game Benchmark,BIG-bench).BIG-bench At present by 204 A task consists of , Got from 132 Of a research institution 442 Authors contributed . The benchmark's mission themes are diverse , Involving Linguistics 、 Child development 、 mathematics 、 Commonsense reasoning 、 biology 、 physics 、 Social prejudice 、 Problems in areas such as software development .BIG-bench Focus on tasks that are considered beyond the capabilities of the current language model . Google in BIG-bench On the assessment OpenAI Of GPT Series model 、 Google's internal density transformer The architecture and Switch Type sparse transformer act , The scale of the model ranges from millions to hundreds of billions of parameters . Besides , There is also a group of human experts who have performed all the tasks , To provide a more accurate baseline level . At present, the survey results of various models include : Model performance and calibration both improve with scale , But the absolute value (absolute term) Poor ( Compared to evaluator performance ); The performance of different classes of models is very similar , But sparsity has a performance gain ; Tasks that improve progressively and predictably usually involve a large amount of knowledge or memory , And on the key scale 「 Breakthrough 」 Behavioral tasks usually involve multiple steps or fragile indicators ; In an environment with a fuzzy background , Social bias usually increases with the scale of the model , But it can go through prompting To improve .
![]()
chart 1: stay BIG-bench On , The overall performance of many models improves with the increase of volume . But for now , All models are in absolute value (absolute term) All aspects were average .
![]()
chart 2: The scope of existing benchmarks is very narrow , And show the performance of fast saturation .
![]()
chart 3:BIG-bench The diversity and scale of the mission .(a) Word cloud of task keywords .(b) Task size distribution measured by sample size .
![]()
BIG-bench Lite (BBL) Is from BIG-bench Of 24 Different JSON A small part of the task , To provide a canonical measure of model performance , Simultaneous ratio BIG-bench Medium 200 Multiple programming and JSON The full assessment of the task is much easier .BBL The performance ranking of the current model on is shown in the figure above .
![]()
chart 4: Every BIG-bench Lite The best in the task ( Blue ) And the average ( gray ) Human score , And the best model configuration ( Chestnut ) Of BIG-bench Lite performance . The random performance of multiple-choice tasks is represented by hatched markers . Google encourages community participants to continue submitting new tasks , It also indicates that the tasks will be reviewed one by one and merged into... In a rolling manner BIG-bench Repository . Task authors will also be included in the author list for future publications .
原网站版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111731473225.html