当前位置：网站首页>Evaluate:huggingface detailed introduction to the evaluation index module

Evaluate:huggingface detailed introduction to the evaluation index module

2022-06-26 15:24:00 【Brother muyao】

One 、 Introduce

evaluate yes huggingface stay 2022 year 5 A library for evaluating machine learning models and datasets was launched at the end of this month , Need to be python 3.7 And above . There are three types of assessment ：

Metric： It is used to evaluate the model through the predicted value and reference value , In the traditional sense indicators , such as f1、bleu、rouge etc. .
Comparison： The same test set for two （ Multiple ） Model evaluation , For example, the results of the two models match Degree of .
Measurement： be used for evaluation Data sets , For example, the number of words 、 Number of words after de duplication, etc .

Two 、 install

pip install ：

pip install evaluate

Source code installation ：

git clone https://github.com/huggingface/evaluate.git
cd evaluate
pip install -e .

Check that it's installed （ It will output the forecast results Dict）：

python -c "import evaluate; print(evaluate.load('accuracy').compute(references=[1], predictions=[1]))"

3、 ... and 、 Use

3.1 load Method

evaluate Each indicator in is a separate Python modular , adopt evaluate.load()（ Click to view the document ） Function fast load , among load The common parameters of the function are as follows ：

path： Mandatory ,str type . It can refer to the label （ Such as accuracy or The iron juice of the community contributed Of muyaostudio/myeval）, If the source installation can also be a pathname （ Such as ./metrics/rouge or ./metrics/rogue/rouge.py）. I use the latter , Because the evaluation script will be downloaded online when the indicator name is directly transmitted , But the net of the unit is not awesome .
config_name： Optional ,str type . Configuration of indicators （ Such as GLUE Each subset of the metrics has a configuration ）
module_type： One of the three evaluation types mentioned above , Default metric, Optional comparison or measurement
cache_dir： Optional , Path to store temporary predictions and references （ The default is ~/.cache/huggingface/evaluate/)

import evaluate

# module_type  The default is  'metric'
accuracy = evaluate.load("accuracy")

# module_type  Explicitly specify  'metric','comparison','measurement', Prevent duplicate names .
word_length = evaluate.load("word_length", module_type="measurement")

3.2 List available metrics

list_evaluation_modules List official （ And the community ） What are the indicators in , You can also see likes , There are three parameters ：

module_type： The type of evaluation module to list ,None It's all , Optional metric,comparison,measurement.
include_community： Whether the community module is included , Default True .
with_details： Returns the full details of the indicator Dict Information , instead of str Indicator name of type . Default False.

print(evaluate.list_evaluation_modules(
	module_type="measurement", 
	include_community=True, 
	with_details=True)
)

Insert picture description here

3.3 Evaluation modules all have attributes

All evaluation modules come with a set of useful attributes , These attributes are useful for using stored in evaluate.EvaluationModuleInfo Object , Properties are as follows ：

description： Index Introduction
citation：latex reference
features： Input format and type , such as predictions、references etc.
inputs_description： Input parameter description document
homepage： Indicator official website
license： Indicator license
codebase_urls： The index is based on the code address
reference_urls： The reference address of the indicator

Insert picture description here

3.4 Calculate the index value （ One time calculation / Incremental calculation ）

Mode one ： One time calculation
function ：EvaluationModule.compute(), Pass in list/array/tensor And so on references and predictions.

>>> import evaluate
>>> metric_name = './evaluate/metrics/accuracy'
>>> accuracy = evaluate.load(metric_name)
>>> accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])

{
    'accuracy': 0.5}  #  Output results

Mode two ： Incremental calculation of single increment
function ： EvaluationModule.add(), be used for for loop One on one Add... To the field ref and pred, Calculate indicators uniformly after adding and exiting the cycle .

>>> for ref, pred in zip([0,0,0,1], [0,0,0,1]):
...     accuracy.add(references=ref, predictions=pred)
... 
>>> accuracy.compute()

{
    'accuracy': 1.0}  #  Output results

Mode three ： Multi increment incremental calculation
function ： EvaluationModule.add_batch(), be used for for loop Many to many Add... To the field ref and pred（ The following example is an addition 3 Yes ）, Calculate indicators uniformly after adding and exiting the cycle .

>>> for refs, preds in zip([[0,1],[0,1],[0,1]], [[1,0],[0,1],[0,1]]):
...     accuracy.add_batch(references=refs, predictions=preds)
... 
>>> accuracy.compute()

{
    'accuracy': 0.6666666666666666}  #  Output results

3.5 Save the evaluation results

function ：evaluate.save(), Parameter is path_or_file, The path or file used to store the file . If only folders are provided , Then the result file will be displayed in result-%Y%m%d-%H%M%S.json Save in the format of ; Biography dict Keyword parameter of type **result,**params.

>>> result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
>>> hyperparams = {
    "model": "bert-base-uncased"}
>>> evaluate.save("./results/", experiment="run 42", **result, **hyperparams)

Insert picture description here

3.6 Automatic assessment

It's kind of like Trainer Encapsulation , You can directly evaluate.evaluator() For automatic evaluation , And can pass strategy The parameters are adjusted to calculate the confidence interval and standard error , It helps to evaluate the stability of the value ：

from transformers import pipeline
from datasets import load_dataset
from evaluate import evaluator
import evaluate

pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb", device=0)
data = load_dataset("imdb", split="test").shuffle().select(range(1000))
metric = evaluate.load("accuracy")
results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric,
                       label_mapping={
    "NEGATIVE": 0, "POSITIVE": 1},
                       strategy="bootstrap", n_resamples=200)

print(results)
>>> {
    'accuracy': 
...     {
    
...       'confidence_interval': (0.906, 0.9406749892841922),
...       'standard_error': 0.00865213251082787,
...       'score': 0.923
...     }
... }