✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Overview

drawing

Python framework to explore, label, and monitor data for NLP

Usage example · Get started · Quick links · Docs

CI Codecov CI CI CI CI CI

Rubrix.mp4

Example: Named Entity Recognition data exploration and annotation with spaCy and the IMDB dataset

What is Rubrix?

Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

Key features:

  • Open: Rubrix is free, open-source, and 100% compatible with major NLP libraries (Hugging Face transformers, spaCy, Stanford Stanza, Flair, etc.). In fact, you can use and combine your preferred libraries without implementing any specific interface.

  • End-to-end: Most annotation tools treat data collection as a one-off activity at the beginning of each project. In real-world projects, data collection is a key activity of the iterative process of ML model development. Once a model goes into production, you want to monitor and analyze its predictions, and collect more data to improve your model over time. Rubrix is designed to close this gap, enabling you to iterate as much as you need.

  • User and Developer Experience: The key to sustainable NLP solutions is to make it easier for everyone to contribute to projects. Domain experts should feel comfortable interpreting and annotating data. Data scientists should feel free to experiment and iterate. Engineers should feel in control of data pipelines. Rubrix optimizes the experience for these core users to make your teams more productive.

  • Beyond hand-labeling: Classical hand labeling workflows are costly and inefficient, but having humans-in-the-loop is essential. Easily combine hand-labeling with active learning, bulk-labeling, zero-shot models, and weak-supervision in novel data annotation workflows.

Example

Let's see Rubrix in action with a quick example: Bootstraping data annotation with a zero-shot classifier

Why:

  • The availability of pre-trained language models with zero-shot capabilities means you can, sometimes, accelerate your data annotation tasks by pre-annotating your corpus with a pre-trained zeroshot model.
  • The same workflow can be applied if there is a pre-trained "supervised" model that fits your categories but needs fine-tuning for your own use case. For example, fine-tuning a sentiment classifier for a very specific type of message.

Ingredients:

  • A zero-shot classifier from the 🤗 Hub: typeform/distilbert-base-uncased-mnli
  • A dataset containing news
  • A set of target categories: Business, Sports, etc.

What are we going to do:

  1. Make predictions and log them into a Rubrix dataset.
  2. Use the Rubrix web app to explore, filter, and annotate some examples.
  3. Load the annotated examples and create a training set, which you can then use to train a supervised classifier.

1. Predict and log

Let's load the zero-shot pipeline and the dataset (we are using the AGNews dataset for demonstration, but this could be your own dataset). Then, let's go over the dataset records and log them using rb.log(). This will create a Rubrix dataset, accesible from the web app.

from transformers import pipeline
from datasets import load_dataset
import rubrix as rb

model = pipeline('zero-shot-classification', model="typeform/distilbert-base-uncased-mnli")

dataset = load_dataset("ag_news", split='test[0:100]')

labels = ['World', 'Sports', 'Business', 'Sci/Tech']

for item in dataset:
    prediction = model(item['text'], labels)

    record = rb.TextClassificationRecord(
        inputs=item["text"],
        prediction=list(zip(prediction['labels'], prediction['scores']))
    )

    rb.log(record, name="news_zeroshot")

2. Explore, Filter and Label

Now let's access our Rubrix dataset and start annotating data. Let's filter the records predicted as Business with high probability and use the bulk-labeling feature for labeling 15 records as Business:

Zeroshot.Example.mp4

3. Load and create a training set

After a few iterations of data annotation, we can load the Rubrix dataset and create a training set to train or fine-tune a supervised model.

# load the Rubrix dataset as a pandas DataFrame
rb_df = rb.load(name='news_zeroshot')

# filter annotated records
rb_df = rb_df[rb_df.status == "Validated"]

# select text input and the annotated label
train_df = pd.DataFrame({
    "text": rb_df.inputs.transform(lambda r: r["text"]),
    "label": rb_df.annotation,
})

Architecture

Rubrix main components are:

  • Rubrix Python client: Python client to log, load, copy and delete Rubrix datasets.
  • Rubrix server: FastAPI REST service for reading and writing data.
  • Elasticsearch: The storage layer and search engine powering the API and the web app.
  • Rubrix web app: Easy-to-use web application for data exploration and annotation.

Quick links

Doc Description
🚶 First steps New to Rubrix and want to get started?
👩‍🏫 Concepts Want to know more about Rubrix concepts?
🛠️ Setup and install How to configure and install Rubrix
🗒️ Tasks What can you use Rubrix for?
📱 Web app reference How to use the web-app for data exploration and annotation
🐍 Python client API How to use the Python classes and methods
👩‍🍳 Rubrix cookbook How to use Rubrix with your favourite libraries (flair, stanza...)
👋 Community forum Ask questions, share feedback, ideas and suggestions
🤗 Hugging Face tutorial Using Rubrix with 🤗 transformers and datasets
💫 spaCy tutorial Using spaCy with Rubrix for NER projects
🐠 Weak supervision tutorial How to leverage weak supervision with snorkel & Rubrix
🤔 Active learning tutorial How to use active learning with modAL & Rubrix
🧪 Knowledge graph tutorial How to use Rubrix with kglab & pytorch_geometric

Get started

To get started you need to follow three steps:

  1. Install the Python client
  2. Launch the web app
  3. Start logging data

1. Install the Python client

You can install the Python client with pip:

pip install rubrix

2. Launch the web app

There are two ways to launch the web app:

  • a) Using docker-compose (recommended).
  • b) Executing the server code manually

a) Using docker-compose (recommended)

Create a folder:

mkdir rubrix && cd rubrix

and launch the docker-contained web app with the following command:

wget -O docker-compose.yml https://git.io/rb-docker && docker-compose up

This is the recommended way because it automatically includes an Elasticsearch instance, Rubrix's main persistence layer.

b) Executing the server code manually

When executing the server code manually you need to provide an Elasticsearch instance yourself.

  1. First you need to install Elasticsearch (we recommend version 7.10) and launch an Elasticsearch instance. For MacOS and Windows there are Homebrew formulae and a msi package, respectively.
  2. Install the Python client together with its server dependencies:
pip install rubrix[server]
  1. Launch a local instance of the web app
python -m rubrix.server

By default, the Rubrix server will look for your Elasticsearch endpoint at http://localhost:9200. But you can customize this by setting the ELASTICSEARCH environment variable.

3. Start logging data

The following code will log one record into a data set called example-dataset:

import rubrix as rb

rb.log(
    rb.TextClassificationRecord(inputs="My first Rubrix example"),
    name='example-dataset'
)

If you go to your Rubrix web app at http://localhost:6900/, you should see your first dataset. The default username and password are rubrix and 1234. You can also check the REST API docs at http://localhost:6900/api/docs.

Congratulations! You are ready to start working with Rubrix.

To better understand what's possible take a look at Rubrix's Cookbook

Community

As a new open-source project, we are eager to hear your thoughts, fix bugs, and help you get started. Feel free to use the Discussion forum or the Issues and we'll be pleased to help out.

Comments
  • Add monitoring examples with FastAPI: Hugging Face and spaCy

    Add monitoring examples with FastAPI: Hugging Face and spaCy

    The idea would be to add a guide (as a Jupyter Notebook) to be included under docs/guides. This Jupyter notebook will showcase the RubrixHTTPMiddleware for monitoring the predictions of a FastAPI inference endpoint. Here is the example with Hugging Face + FastAPI:

    from fastapi import FastAPI
    from typing import List
    from transformers import pipeline
    from rubrix.client.asgi import RubrixLogHTTPMiddleware
    
    classifier = pipeline("sentiment-analysis", return_all_scores=True)
    
    app = FastAPI()
    
    # define the middleware for logging predictions into a Rubrix Dataset
    app.add_middleware(
        RubrixLogHTTPMiddleware,
        api_endpoint="/predict",
        dataset="monitoring_dataset_v1",
        # you could post-process the predict output with a custom record_mapper function
        # record_mapper=custom_text_classification_mapper,
    )
    
    # prediction endpoint
    @app.post("/predict")
    def predict_batch(batch: List[str]):
        predictions = classifier(batch)
        return [
            {
                "labels": [p["label"] for p in prediction],
                "probabilities": [p["score"] for p in prediction],
            }
            for prediction in predictions
        ]
    

    The steps would be to:

    1. Create a notebook and include the above example
    2. Add an example with a pre-trained transformer TokenClassifier (for example: https://huggingface.co/dslim/bert-base-NER)
    3. Add an example with a spaCy NER pipeline.
    4. (Optionally) Include an example dashboard with Kibana (screenshots, gif or video)
    5. (Optionally) Include an example with ray serve
    documentation good first issue help wanted 
    opened by dvsrepo 19
  • updated readme with `conda` install instruction

    updated readme with `conda` install instruction

    This closes #781.

    • [x] added conda installation instruction (rubrix is available on conda-forge channel)
    • [x] added badges:
      • [x] conda-forge/rubrix (with version)
      • [x] conda-forge/rubrix (with platform specification): example -- "noarch"
      • [x] docs badge
    opened by sugatoray 14
  • [NER Fine tuning] content selection

    [NER Fine tuning] content selection

    Multi word

    Actual state : (VIEW SS) 1- I select various words, highlight is grey and in a solid block (Highlight/words). 2- When selection is done, highlight selection is splited and label selector appears.

    • [x] Should be:

    1- I select various words, highlight is grey and splited (highlight/word) 2- When selection is done, highlight selection is a solid block label selector appears.

    Delete labelling

    • [x] Make clicable the whole tooltip to delete

    Selection on a searched word

    • [x] Selection highlight should not be cut (SS)
    • [x] When selection is containing a search word the label selector does not appear (how it works only on right>left sense)
    • [x] In general : change appearance of results : in place of Orange highlight show text in bold

    Cursor

    • [x] Active "hand" cursor (pointer) on piece of text already annotated/Predicted
    • [x] Active "Text Select" cursor on the rest of record
    • [x] Enlarge the hover state to the whole area : (record + annotated tooltip + empty space between them) (record + predicted tooltip + empty space between us)

    New Select label modal

    • [x] Integrate new UI modal
    • [x] In case of unique label, dont show modal, and just affect label after selecting text
    • [x] Add logic to show first and preselected the last label used
    • [x] Add following Keyboard shortcut: Enter to valid preselected label, and vertical arrow keyboard or Number to valid other labels
    opened by Amelie-V 13
  • Add text2text example (e.g., text summarisation)

    Add text2text example (e.g., text summarisation)

    Add the text summarisation fine-tuning tutorial similar to sentiment classifier fine-tuning tutorial:

    https://rubrix.readthedocs.io/en/stable/tutorials/06-labeling-finetuning.html#3.-Fine-tune-the-pre-trained-mode

    documentation good first issue help wanted 
    opened by frascuchon 13
  • fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

    fix: Compute predicted properly for token classification [NEEDS_DATA_UPGRADE]

    This PR fixes the way predicted ok/ko info is computed for token classification records.

    To apply this fix to already created datasets, you must first re-log records. Otherwise, stored info won't be updated.

    Closes #1955

    opened by frascuchon 12
  • [Workspaces] Users without personal datasets

    [Workspaces] Users without personal datasets

    Users without personal datasets but that belongs to one or more workspaces which have datasets, should automatically change to one of those workspace?

    Better to show all datasets from all workspaces in datasets list allowing to filter by workspace?

    question app 
    opened by frascuchon 11
  • [Text Class] Optimize Long records view *Prioritary*

    [Text Class] Optimize Long records view *Prioritary*

    • [x] Show labels buttons area above the fold.

    • [x] Create Action to open/close on click the full record in the same view

    • [x] Copy "Show full record" "Show less"

    • [ ] I would grap the opportunity to update the "View more" "view less" on Metrics modal to "Show more" "Show less" and apply the same style there

    enhancement 
    opened by Amelie-V 11
  • [Search] Improve and normalizes the search data model

    [Search] Improve and normalizes the search data model

    Things to keep in mind:

    • Normalize text inputs fields: text, inputs, words must be normalized and use a common pattern for all tasks
    • Several es analyzers for text fields: standard and whitespace(?) for fine tuning searches. Default as standard
    • What about text fields in metadata ? For now, only terms queries are supported. It's mean that metadata fields with large content are not enabled to be queries as full text search.
    • Created indices should contain mapping info only for its fields. A text classification index should not include mapping info for tokens or text predicted (text2text).
    • Review filter fields and align with UI names (if any)
    • What about nested fields? like token or metrics info for token classification, or label and its score for text classification. As default, query string dsl does not support nested queries, but it could be nice include some minimal support for that kind of queries.

    @dvsrepo @dcfidalgo Anything to include here?

    Tasks

    To achieve to do the work, we need tackle following tasks (that will be created as separated issues and linked here)

    1. [Datasets] Avoid using global template for all indices
    2. [Datasets] Dataset migration mechanisms for each release
    3. [Datasets] New es document model per task with backward compatibility fields
    4. [Datasets] Apply migration to new es doc model
    5. [Datasets] Build searches and aggregations using new doc model
    enhancement server 
    opened by frascuchon 11
  • Devise workflow to test the tutorials via a github action

    Devise workflow to test the tutorials via a github action

    The idea here is to devise a workflow to test our tutorials in a semi-automatic way. Ideally, we have a workflow that we can launch manually and let's say every two weeks or so, to test our tutorials. Maybe we can use nbmake for this and follow this blogpost. The tricky part is that for some tutorials we need to change/add/delete a few cells to be able to run them in an automated way ...

    documentation good first issue help wanted 
    opened by dcfidalgo 10
  • [Weak supervision] Rules numbers by label

    [Weak supervision] Rules numbers by label

    For instance:

    Sci/tech 2 Sports 1 Business 4 Politics 0 World 0

    his feature could be used for two things:

    • Help to know how is going the rule definition
    • See the full label list (in "define rules" we dont have this list by default)
    ui 
    opened by Amelie-V 9
  • Any plan to support no-whitespace language?

    Any plan to support no-whitespace language?

    I am planning to use rubrix for Japanese text data. The search functionality doesn't seem to work well on this language. I think it's better if we can customize the tokenizer used in elasticsearch instead of hardcoded "whitespace" tokenizer.

    opened by faisalron 9
  • use a default vector for vector search like`TF-IDF`

    use a default vector for vector search like`TF-IDF`

    Is your feature request related to a problem? Please describe. I do not want to set up anything for vector search but I do want to use it.

    Describe the solution you'd like I would like to see a very straightforward model-agnostic way of using the feature without any specific implementation. DatasetSettings.vectorsearch_tf_idf = True.

    Describe alternatives you've considered N.A.

    Additional context N.A.

    enhancement 
    opened by davidberenstein1957 0
  • chore(deps-dev): update fastapi requirement from <0.89,>=0.75 to >=0.75,<0.90

    chore(deps-dev): update fastapi requirement from <0.89,>=0.75 to >=0.75,<0.90

    Updates the requirements on fastapi to permit the latest version.

    Release notes

    Sourced from fastapi's releases.

    0.89.0

    Features

    • ✨ Add support for function return type annotations to declare the response_model. Initial PR #1436 by @​uriyyo.

    Now you can declare the return type / response_model in the function return type annotation:

    from fastapi import FastAPI
    from pydantic import BaseModel
    

    app = FastAPI()

    class Item(BaseModel): name: str price: float

    @​app.get("/items/") async def read_items() -> list[Item]: return [ Item(name="Portal Gun", price=42.0), Item(name="Plumbus", price=32.0), ]

    FastAPI will use the return type annotation to perform:

    • Data validation
    • Automatic documentation
      • It could power automatic client generators
    • Data filtering

    Before this version it was only supported via the response_model parameter.

    Read more about it in the new docs: Response Model - Return Type.

    Docs

    Translations

    • 🌐 Add Russian translation for docs/ru/docs/fastapi-people.md. PR #5577 by @​Xewus.
    • 🌐 Fix typo in Chinese translation for docs/zh/docs/benchmarks.md. PR #4269 by @​15027668g.
    • 🌐 Add Korean translation for docs/tutorial/cors.md. PR #3764 by @​NinaHwang.

    ... (truncated)

    Commits
    • 69bd7d8 🔖 Release version 0.89.0
    • a6af7c2 📝 Update release notes
    • aa6a8e5 📝 Update release notes
    • c482dd3 ⬆ Update coverage[toml] requirement from <7.0,>=6.5.0 to >=6.5.0,<8.0 (#5801)
    • 681e5c0 📝 Update release notes
    • eb39b0f 📝 Update release notes
    • 27ce2e2 📝 Add External Link: Authorization on FastAPI with Casbin (#5712)
    • f56b0d5 ⬆ Update uvicorn[standard] requirement from <0.19.0,>=0.12.0 to >=0.12.0,<0.2...
    • 5c6d7b2 📝 Update release notes
    • 78813a5 ✏ Fix typo in docs/en/docs/async.md (#5785)
    • Additional commits viewable in compare view

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • add repr method for Rule, Dataset.

    add repr method for Rule, Dataset.

    Description

    Please include a summary of the changes and the related issue. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Closes #2046

    Type of change

    Please delete options that are not relevant.

    • [ ] New feature (non-breaking change which adds functionality)

    How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    import argilla as rg
    from argilla.labeling.text_classification.rule import Rule
    
    plz = Rule(query="plz OR please", label="SPAM")
    print(repr(plz))
    >>> Rule(query='plz OR please', label='SPAM', name='plz OR please')
    
    
    records = [
            rg.TextClassificationRecord(text="example"),
            rg.TextClassificationRecord(text="another example"),
            rg.TextClassificationRecord(text="another example another example another example another example another example another example"),
        ]
    dataset = rg.DatasetForTextClassification(records=records)
    print(dataset)
    >>>
        	text                          	annotation	prediction
    0   	example                       	None      	None      
    1   	another example               	None      	None      
    2   	another example another exampl	None      	None      
    ...
    3 TextClassificationRecord records
    
    

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [x] I added comments to my code
    • [x] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [x] I have added tests that prove my fix is effective or that my feature works
    opened by Ankush-Chander 1
  • feat(Client): RecordTextClassification pass only the necessary data nstead of all the dataset

    feat(Client): RecordTextClassification pass only the necessary data nstead of all the dataset

    Description

    ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the RecordTextClassification component

    WARNING : to merge after #2145 and #2143 have been merge

    Closes #(issue_number)

    Type of change

    Please delete options that are not relevant.

    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)

    How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    • [x] multilabel
    • [x] singlelabel

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [ ] I added comments to my code
    • [ ] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [ ] I have added tests that prove my fix is effective or that my feature works
    opened by keithCuniah 0
  • feat(Client): ClassifierExplorationArea.vue pass only the necesarry data

    feat(Client): ClassifierExplorationArea.vue pass only the necesarry data

    Description

    ref : #2142 Instead of passing all the dataset, only the necessary data is passed through props into the ClassifierExplorationArea.vue

    Type of change

    Please delete options that are not relevant.

    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected) How Has This Been Tested

    Please describe the tests that you ran to verify your changes. And ideally reference tests.

    • [x] multilabel
    • [x] singlelabel

    Checklist

    • [x] I have merged the original branch into my forked branch
    • [x] I added relevant documentation
    • [x] follows the style guidelines of this project
    • [x] I did a self-review of my code
    • [ ] I added comments to my code
    • [ ] I made corresponding changes to the documentation
    • [x] My changes generate no new warnings
    • [ ] I have added tests that prove my fix is effective or that my feature works
    opened by keithCuniah 0
Releases(v1.1.1)
  • v1.1.1(Nov 29, 2022)

  • v1.1.0(Nov 24, 2022)

    1.1.0 (2022-11-24)

    Highlights

    Add, update, and delete rules from a Dataset using the Python client

    You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

    # Read a file with keywords or phrases
    labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")
    
    # Create rules
    predefined_labeling_rules = []
    for index, row in labeling_rules_df.iterrows():
        predefined_labeling_rules.append(
            Rule(row["query"], row["label"])
        )
    
    # Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
    add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules
    

    You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

    Sort by timestamp fields in the UI

    Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

    Features

    • #1929 add warning about using wrong hostnames (#1930) (a3bc554)
    • Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
    • Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
    • Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
    • Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
    • Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
    • Update error page (#1932) (caeb7d4), closes #1894
    • Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

    Bug Fixes

    Documentation

    As always, thanks to our amazing contributors!

    • docs: Link key features (#1805) (#1809) by @chschroeder
    • View Docs link in frontend header users.vue (#1915) by @bengsoon
    • fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Nov 4, 2022)

  • v0.19.0(Oct 24, 2022)

  • v0.18.0(Oct 5, 2022)

    0.18.0 (2022-10-05)

    ⚡ Highlights

    Better validation of token classification records

    When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens. Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

    With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

    For example, the following record:

    import rubrix as rb
    
    rb.TokenClassificationRecord(
        tokens=["I", "love", "Paris"],
        text="I love Paris!",
        prediction=[("LOC",7,13)]
    )
    

    Will give you the following error message:

    ValueError: Following entity spans are not aligned with provided tokenization
    Spans:
    - [Paris!] defined in ...love Paris!
    Tokens:
    ['I', 'love', 'Paris']
    

    Delete records by query

    Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

    import rubrix as rb
    
    ## Delete by id
    rb.delete_records(name="example-dataset", ids=[1,3,5])
    
    ## Discard records by query
    rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)
    

    New tutorials

    We have two new tutorials!

    Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

    Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

    Features

    Bug Fixes

    Visual enhancements

    Documentation

    • Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
    • Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
    • fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693
    • raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
    • Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
    • using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

    As always, thanks to our amazing contributors!

    • refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander
    • feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander
    • fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo
    • docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder
    • refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia
    • docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen
    • refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen
    Source code(tar.gz)
    Source code(zip)
  • v0.17.0(Aug 22, 2022)

    0.17.0 (2022-08-22)

    ⚡ Highlights

    Preparing a training set in the spaCy DocBin format

    prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

    Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

    With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

    import spacy
    import rubrix as rb
    
    from datasets import load_dataset
    
    # Load annotated dataset from Rubrix
    rb_dataset = rb.load("ner_dataset")
    
    # Loading an spaCy blank language model to create the Docbin, as it works faster
    nlp = spacy.blank("en")
    
    # After this line, the file will be stored in disk
    rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")
    

    You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

    Load large datasets using batches

    Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

    Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

    An example of reading the first 1000 records and the next batch of up to 1000 records:

    import rubrix as rb
    dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
    dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)
    

    The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

    Larger pagination sizes for faster bulk review and annotation

    Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

    Screenshot 2022-08-25 at 10 33 58

    Copy record text to clipboard

    Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

    Screenshot 2022-08-25 at 10 38 19

    Better error logging for generic errors

    Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

    Features

    • Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
    • Allow rb.load fetch records in batches passing the from_id argument (3e6344a)
    • Copy to clipboard the record text (#1625) (d634a7b), closes #1616
    • Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
    • Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
    • prepare_for_training supports spacy (#1635) (8587808)

    Bug Fixes

    Documentation

    Visual enhancements

    You can see all work included in the release here

    • fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
    • chore: configure miniconda for readthedocs builder by @frascuchon
    • style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
    • feat: Copy to clipboard the record text (#1625) by @leiyre
    • docs: Add Slack support link in README's get started (#1688) by @dvsrepo
    • chore: update version by @frascuchon
    • feat: Add new pagination size ranges (#1667) by @leiyre
    • fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
    • feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras
    • fix(Client): reusing the inner httpx client (#1640) by @frascuchon
    • feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
    • docs: spacy DocBin cookbook (#1642) by @ignacioct
    • feat: prepare_for_training supports spacy (#1635) by @frascuchon
    • style: Improve card spacing (#1638) by @leiyre
    • docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
    • chore: remove old rubrix client class (#1639) by @frascuchon
    • feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
    • doc: show metric graphs in documentation (#1669) by @leiyre
    • fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
    • fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.16.1(Jul 22, 2022)

    0.16.1 (2022-07-22)

    Bug Fixes

    • 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
    • Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
    • Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

    Documentation

    You can see all work included in the release here

    • fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
    • fix: Display metadata in Text2Text dataset (#1626) by @leiyre
    • chore: set version by @dcfidalgo
    • docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
    • fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 8, 2022)

    0.16.0 (2022-07-08)

    Highlights

    👂 Listeners: enable more interactive workflows between client and server

    Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

    You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

    We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

    from rubrix.listeners import listener
    from sklearn.metrics import accuracy_score
    
    # Define some helper variables
    LABEL2INT = trec["train"].features["label-coarse"].str2int
    ACCURACIES = []
    
    # Set up the active learning loop with the listener decorator
    @listener(
        dataset=DATASET_NAME,
        query="status:Validated AND metadata.batch_id:{batch_id}",
        condition=lambda search: search.total==NUM_SAMPLES,
        execution_interval_in_seconds=3,
        batch_id=0
    )
    def active_learning_loop(records, ctx):
    
        # 1. Update active learner
        print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
        y = np.array([LABEL2INT(rec.annotation) for rec in records])
    
        # initial update
        if ctx.query_params["batch_id"] == 0:
            indices = np.array([rec.id for rec in records])
            active_learner.initialize_data(indices, y)
        # update with the prior queried indices
        else:
            active_learner.update(y)
        print("Done!")
    
        # 2. Query active learner
        print("Querying new data points ...")
        queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
        ctx.query_params["batch_id"] += 1
        new_records = [
            rb.TextClassificationRecord(
                text=trec["train"]["text"][idx],
                metadata={"batch_id": ctx.query_params["batch_id"]},
                id=idx,
            )
            for idx in queried_indices
        ]
    
        # 3. Log the batch to Rubrix
        rb.log(new_records, DATASET_NAME)
    
        # 4. Evaluate current classifier on the test set
        print("Evaluating current classifier ...")
        accuracy = accuracy_score(
            dataset_test.y,
            active_learner.classifier.predict(dataset_test),
        )
        ACCURACIES.append(accuracy)
        print("Done!")
    
        print("Waiting for annotations ...")
    

    📖 New docs!

    https://rubrix.readthedocs.io/

    Screenshot 2022-07-13 at 12 49 42

    🧱 extend_matrix: Weak label augmentation using embeddings

    This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

    Features

    Bug Fixes

    Documentation

    • #1512: change theme to furo (#1564, #1604) (98869d2), closes #1512
    • add 'how to prepare your data for training' to basics (#1589) (a21bcf3)
    • add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421
    • Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)
    • add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

    You can see all work included in the release here

    • chore(docs): remove by @frascuchon
    • docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo
    • docs(#1512): change theme to furo (#1564, #1604) by @frascuchon
    • chore: set version by @frascuchon
    • feat(token-class): adjust token spans spaces (#1599) by @frascuchon
    • feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon
    • docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo
    • test: configure numpy to disable multi threading (#1593) by @frascuchon
    • docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo
    • feat(#1561): standardize icons (#1565) by @leiyre
    • Feat: Improve from datasets (#1567) by @dcfidalgo
    • feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo
    • docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon
    • refactor: remove words references in searches (#1571) by @frascuchon
    • ci: check conda env cache (#1570) by @frascuchon
    • fix(#1264): discard first space after a token (#1591) by @frascuchon
    • ci(package): regenerate view snapshot (#1600) by @frascuchon
    • fix(#1574): search highlighting for a single dot (#1592) by @leiyre
    • fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre
    • fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon
    • fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon
    • fix: compatibility with new dataset version (#1566) by @dcfidalgo
    • fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre
    • fix(#1545): highlight words with accents (#1550) by @leiyre
    Source code(tar.gz)
    Source code(zip)
  • v0.15.0(Jun 8, 2022)

    0.15.0 (2022-06-08)

    🔆 Highlights

    🏷️ Configure datasets with a labeling scheme

    You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

    import rubrix as rb
    
    # Define labeling schema
    settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])
    
    # Apply seetings to a new or already existing dataset
    rb.configure_dataset(name="my_dataset", settings=settings)
    
    # Logging to the newly created dataset triggers the validation checks
    rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
    #BadRequestApiError: Rubrix server returned an error with http status: 400
    

    Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

    🧱 Weak label matrix augmentation using embeddings

    You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

    Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

    🏛️ Tutorial Gallery

    Tutorials are now organized into different categories and with a new gallery design!

    Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

    🏁 Basics guide

    This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

    Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

    Features

    • #1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134
    • #1432: configure datasets with a label schema (21e48c0), closes #1432
    • #1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446
    • #1460: include text hyphenation (#1469) (ec23b2d), closes #1460
    • #1463: change icon position in table header (#1473) (5172324), closes #1463
    • #1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467
    • configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)
    • UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

    Bug Fixes

    • #1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522
    • #1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527
    • #1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532
    • #1533: restrict highlighted fields (3a8b8a9), closes #1533
    • #1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)
    • #1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539
    • metrics: compute f1 for text classification (#1530) (147d38a)
    • search: highlight only textual input fields (8b83a82), closes #1538 #1544

    New contributors

    @RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413

    Source code(tar.gz)
    Source code(zip)
  • v0.14.2(May 31, 2022)

    0.14.2 (2022-05-31)

    Bug Fixes

    • #1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514
    • #1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516
    • #1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517
    • #1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518
    • UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)
    Source code(tar.gz)
    Source code(zip)
  • v0.14.1(May 20, 2022)

    0.14.1 (2022-05-20)

    Bug Fixes

    • #1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447
    • #1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472
    • #1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)
    • #1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503
    • documentation: fix user management guide (#1511) (63f7bee), closes #1501
    • filters: sort filter values by count (#1488) (0987167), closes #1484
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(May 10, 2022)

    0.14.0 (2022-05-10)

    Async version of rb.log

    You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

    from bentoml import BentoService, api, artifacts, env
    from bentoml.adapters import JsonInput
    from bentoml.frameworks.spacy import SpacyModelArtifact
    
    import rubrix as rb
    
    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    
    
    @env(infer_pip_packages=True)
    @artifacts([SpacyModelArtifact("nlp")])
    class SpacyNERService(BentoService):
    
        @api(input=JsonInput(), batch=True)
        def predict(self, parsed_json_list):
            result, rb_records = ([], [])
            for index, parsed_json in enumerate(parsed_json_list):
                doc = self.artifacts.nlp(parsed_json["text"])
                prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents]
                rb_records.append(
                    rb.TokenClassificationRecord(
                        text=doc.text,
                        tokens=[t.text for t in doc],
                        prediction=[
                            (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
                        ],
                    )
                )
                result.append(prediction)
    
            rb.log(
                name="monitor-for-spacy-ner",
                records=rb_records,
                tags={"framework": "bentoml"},
                background=True,
                verbose=False
            ) # By using the background=True, the model latency won't be affected
    
            return result
    

    Confidence scores in Token Classification (NER)

    To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

    import rubrix as rb
    
    text = "Rubrix is a data science tool"
    
    record = rb.TokenClassificationRecord(
        text=text, 
        tokens=text.split(" "), 
        prediction=[("PRODUCT",  0, 6, 0.99)]
    )
    
    rb.log(record, "ner_with_scores")
    

    Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

    Screenshot 2022-05-12 at 11 49 43

    If you want to see this in action, check this blog post by David Berenstein:

    https://www.rubrix.ml/blog/concise-concepts-rubrix/

    Rule metrics sidebar

    We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

    This sidebar should help you quickly understand your progress:

    Screenshot 2022-05-12 at 11 52 10

    See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.13.3(Apr 27, 2022)

  • v0.13.2(Apr 12, 2022)

    0.13.2 (2022-04-12)

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(Apr 1, 2022)

  • v0.13.0(Mar 30, 2022)

    0.13.0 (2022-03-30)

    🗂 Multilabel weak supervision

    You can now build multilabel text classification datasets using query-based rules

    If you want to get started, check out this tutorial.

    https://user-images.githubusercontent.com/1107111/160930404-7b909f1e-b871-4e4c-b1c8-ea9eabfcad21.mp4

    🤗 Reading Hugging Face datasets from the Hub

    You can now read ANY text classification, NER, or text2text dataset directly from the Hub and load it into Rubrix.

    To understand how Rubrix datasets work check out this guide.

    rubrix_conll

    👥 Redesigned team workspaces

    Organizing teams and datasets is a key Rubrix feature. After several rounds of feedback with early users, we've completely redesigned the user experience. Let us know what you think.

    image

    You can get started and configure users and workspaces following this guide

    🔎 Guide for the query language and model

    We have included a new in-depth guide about the Lucene-based query language and data model used for search, weak labeling, loading subsets of data, and metrics.

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Mar 11, 2022)

  • v0.11.1(Mar 11, 2022)

  • v0.12.0(Mar 8, 2022)

    0.12.0 (2022-03-08)

    Features

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Feb 20, 2022)

    0.11.0 (2022-02-19)

    Highlights

    Introducing rb.Dataset* and 🤗 Hub integration

    The Dataset classes are lightweight containers for Rubrix records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame, datasets.Dataset) as well as sharing and versioning Rubrix datasets using the Hugging Face Hub.

    With this release, Rubrix users and teams can use the Hugging Face Hub to share and read both public and private Rubrix datasets for TextClassification, TokenClassification, and Text2Text datasets. This opens up a whole new world of possibilities for data reproducibility and sharing. Let's see an example:

    import rubrix as rb
    from datasets import load_datasets
    
    # 👧🏻 🏷️ Leire has labeled a text classification dataset using a local Rubrix instance
    dataset_rb = rb.load("text_classification_ds", as_pandas=False)
    
    # 👧🏻 exports a Rubrix Dataset to a hf Dataset
    dataset_ds = dataset_rb.to_datasets()
    
    # 👧🏻 🚀 Leire shares the labelled dataset with the world 
    dataset_ds.push_to_hub("text_classification_ds")
    
    # 👨 John downloads the dataset from the Hugging Face Hub
    dataset_ds = load_dataset("leire/text_classification_ds", split="train")
    
    # 👨 reads in dataset
    dataset_rb = rb.read_datasets(dataset_ds, task="TextClassification")
    
    # 👨 🏷️ logs the dataset and continues labeling with his own Rubrix instance
    rb.log(dataset_rb, "john_text_classification_ds")
    

    You can read more at https://rubrix.readthedocs.io/en/stable/guides/datasets.html

    For each record type, there’s a corresponding Dataset class called DatasetFor<RecordType>. You can look up their API in the reference section.

    Improving NER UI and UX

    The UI for Token Classification has been completely redesigned to provide a better user experience for exploration and annotation. This is the first of a set of changes focusing on annotation productivity for token classification.

    Screenshot 2022-02-21 at 12 39 22

    Features

    Bug Fixes

    • #1140: fix/make client models more consistent (#1147) (926bb16), closes #1140
    • client: parse unauthorized api error properly (#1164) (1a5a08d)
    • search: prevent metrics computation breaks searches (#1175) (9f2adc9)
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Feb 20, 2022)

    0.10.0 (2022-02-12)

    Now you can use filters in the Define Rules mode (weak labeling). These filters are useful for seeing the impact of rules on specific dataset subpopulations/subsets (e.g., with certain metadata fields, annotated records, etc.):

    Screenshot 2022-02-14 at 11 56 27

    Features

    Bug Fixes

    • #1054: reduce collapsable area. Optimize for annotation (#1106) (48024ba), closes #1054
    • #1054: remove old scroll padlock button (a1d6444), closes #1054
    • #1094: remove computed record fields returned in API results (#1095) (cd61d1e), closes #1094
    • #831: Remove sort field when only one is applied (#1116) (36b276b), closes #831
    • convert pd.NaT to None for event_timestamp (#1105) (21e78e4)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Feb 4, 2022)

    🎉 0.9.0 (2022-02-02)

    • Improve logging
    • Small improvements to the labelling module and weak labeling mode
    • Better setup documentation (python -m rubrix)

    Features

    • #932: label models now modify the prediction_agent when calling LabelModel.predict (#1049) (4a024ee), closes #932
    • #953: add additional metrics to LabelModel.score method (#979) (2887907), closes #953
    • #955: add default for rules in WeakLabels (#976) (34389d3), closes #955 #1011

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.8.2(Jan 31, 2022)

    0.8.2 (2022-01-31)

    Features

    • #1036: remove prediction ok/ko in labelling rules (#1037) (672b852), closes #1036
    • #735: add warning when agent but no prediction/annotation is provided (#987) (ba88c34), closes #735

    Bug Fixes

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Jan 20, 2022)

    0.8.1 (2022-01-20)

    Bug Fixes

    • #1002: Show 0 records overall metrics when no rules defined (#1013) (a8a5c79), closes #1002 #1002
    • Breadcrumbs: copy workspace from the breadcrumbs when dataset loading has errors #1003 (33e372d), closes #844
    • statics: handle 404 errors for static files (#1006) (f4b656a)
    • #800: compute common aggregations one by one (#990) (8cf420a), closes #800
    • #800: limit number of metadata fields (#993) (bb6b76b), closes #800
    • #905: copy dataset with rules (#948) (8597b83), closes #905
    • #974: display the dropdown in the last record of the scroll (#986) (e5f8d53), closes #974
    • #977: Remove redirection when accessing login (#996) (b3fe2cb), closes #977
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1-alpha.3(Jan 20, 2022)

  • v0.8.1-alpha.2(Jan 20, 2022)

  • v0.8.1-alpha.1(Jan 19, 2022)

  • v0.8.1-alpha.0(Jan 19, 2022)

  • v0.8.0(Jan 12, 2022)

    Introducing interactive Weak labeling (Define rules mode) 🚀

    We are glad to introduce the most important feature to date: now it's possible to iterate on labeling queries directly in the UI with initial support for multi-class text classification. Multilabel and token classification support is coming soon.

    See the video for the recommended workflow:

    https://user-images.githubusercontent.com/1107111/149346471-93cbd7ee-96a2-451a-8f5e-f9e26b246407.mp4

    Check the updated tutorial: https://rubrix.readthedocs.io/en/master/tutorials/weak-supervision-with-rubrix.html

    What's changed

    • [WeakSupervision] Change load_rules import path in guide and tutorial (#939)
    • fix links to new web app reference (#936)
    • Bugfixes/avoid infinite loop when dataset loading (#934)
    • show nan instead of 0 for precision in summary (#930)
    • fix(api): include_metrics param only for search endponts (#929)
    • [Documentation] Update title page video for docs (#928)
    • update skweak tutorial (#922)
    • [Documentation] Updating the web app docu (#827)
    • publish python package to test.pypi for master and releases branches (#927)
    • [WeakLabels] Align WeakLabels.summary() with web app (#925)
    • UI: show rules without precision properly (#919)
    • chore(build): build docker images for release branches (#921)
    • Docs: Updates readme front video (#923)
    • Docs: Updates weak supervision resources (#920)
    • feat(rules): compute total & ann. coverage before label selection (#916)
    • fix(rules): compute annotated coverage when no label properly (#915)
    • Tutorial: Human-in-the-loop weak supervision with skweak (#869)
    • UI: include affected #records to overall coverage/ann. coverage metrics (#914)
    • fix lint build (#913)
    • UI: manage precision and rules without annotation coverage (#909)
    • fix(#876): process 400 response detail properly (#889)
    • feat(rules): allow compute partial query rule metrics (#907)
    • fix(security): providing default workspace should pass check (#911)
    • UI: reset filters from define rules view (#908)
    • UI: Show number of created rules in rules management view (#910)
    • UI: drop access to rule name field (#904)
    • fix(rules): prevent lost rules with dataset updates (#892)
    • fix(datasets): process owner as part of dataset id (#870)
    • (UI) Rules summary metrics format (#888)
    • UI: Improve code snippet for empty workspace (#886)
    • fix(UI): Remove case sensitive when filtering labels (#882)
    • Docs: Updates Flair zeroshot tutorial (#887)
    • removing wrong video (#885)
    • Update readme (#883)
    • fix(UI) Metrics value by default if no metric (#875)
    • feat(metrics): add token level metrics for token classification from client (#849)
    • UI: New rule metrics layout (#861)
    • chore: expose load_rules from base module (#866)
    • Docs: Regenerates graphs metrics guide (#865)
    • updating loss video (#864)
    • Docs: Update weak supervision guide (#863)
    • Update README.md (#862)
    • Fix: Link loss tutorial (#859)
    • Docs: Improve loss tutorial (#858)
    • Docs: Improve AL and ws tutorials (#857)
    • chore(ci): Include component testing configuration (#839)
    • fix/loss video updated (#853)
    • Docs: Weak supervision guide update (#855)
    • chore(app): upgrade lint dependencies (#841)
    • feat: weak supervision mode (#814)
    • Docs: Review hf tutorial (#852)
    • fix: error link to workspace home (#845)
    • fix(metrics): compute token length for each token (#850)
    • add streaming (#851)
    • fix(rules): prevent division by 0 for overall metrics (#848)
    • small change
    • [Tutorials] Update media structure, remove TLDR heading (#847)
    • Updating videos and images for sentiment classification tutorial (#846)
    • fix(rules): prevent division by zero (#843)
    • new folder and videos for model loss tutorial (#805)
    • feat(token class): add metrics at token level (#838)
    • new folder and images for active learning tutorial (#796)
    • [Tutorials] Typo fix in find label errors tutorial (#842)
    • [Tutorials] Add the new find_label_errors tutorial (#833)
    • [Rule] Modify the client API to the server's weak supervision feature (#840)
    • [LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)
    • feat (search): allow to filtering record metrics fields in search (#837)
    • fix(ui): remove workspace home from code snippet api url (#834)
    • ui: Hide validate button for binary cases in Text classifier (#830)
    • fix print message (#829)
    • feat: Include workspace in url path (#820)
    • fix(ui): align records and global action layouts #825
    • fix(ui): Show labels as selected after validate (#826)
    • feat(labeling rule): implements api endpoint to fetch a single rule (#817)
    • [LabelErrors] Add find_label_errors method (#775)
    • fix(ui): Fix styles in Safari (#815)
    • docs: Add contributors to readme (#822)
    • add missing rubrix import (#819)
    • new folder and images for spacy tutorial (#794)
    • feat(labeling rules): allow edition for rule label and description (#813)
    • refactor(labeling rules): optional label for rule metrics (#811)
    • Fix token alignment on CreationTokenClassificationRecord (#812)
    • feat(server): add overall dataset labeling rules metrics (#807)
    • feat(labeling rules): add coverage for annotated records (#806)
    • fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)
    • new folder and images for zeroshot ner tutorial (#804)
    • new folder and images for zeroshot data annotation tutorial (#803)
    • fix(log): check multi-label integrity without search aggregations (#802)
    • updated images, added folder for fastapi tutorial (#801)
    • added folder for weak supervision tutorial (#795)
    • feat(weak supervision): client labeling rules from server (#799)
    • feat(server): labeling rule metrics (#790)
    • fix/edit zero-shot tutorial (#774)
    • fix/edited fastapi tutorial (#773)
    • Fix/edit ner flair tutorial (#766)
    • Fix/edit weaksupervision tutorial (#759)
    • fix(ui): Little changes in fonts (#793)
    • fix(ui): Allow open dataset in new tab from datasets list (#792)
    • feat(server): rubrix namespaces for elasticsearch indices (#789)
    • fix(ui): Show annotation after global validation (#786)
    • remove reload arg launching server using python (#787)
    • updated readme with conda install instruction (#788)
    • fix(ui): Hide scroller component when loading or paginate (#784)
    • fix(ui): allow remove metadata filter from record metadata modal (#772)
    • fix(ui): Token Classifier: validate record without annotation or prediction (#782)
    • Fix/edit active learning tutorial (#760)
    • Docs:minor changes to loss tutorial (#778)
    • Fix/edit model loss tutorial (#767)
    • fix(server): missing deprecated dep (#777)
    • fix(ui): Global validate for records without annotation or prediction (#746)
    • Fix/edit spacy tutorial (#758)
    • Fix/edit labeling tutorial (#750)
    • fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)
    • [Requirements] Require python>=3.7 (#770)
    • [Labeling] Add FlyingSquid label model (#755)
    • Update README.md (#769)
    • Adds Flair example to guide (#762)
    • docs: Updates huggingface examples and adds monitor for Flair (#761)
    • feat(search): show boolean values in metadata (#753)
    • feat(server): allow handle labeling rules for datasets from API (#744)
    • fix(imports): import monitoring with spacy<3.0 fails (#754)
    • [UI] new fonts families (#751)
    • fix(scroll): using new scroll component (#710)
    • fix(ui): filter "validatable" records for global action validate button (#741)
    • feat(monitor): flair ner auto-monitor (#738)

    New Contributors

    • @sugatoray made their first contribution
    • @ruanchaves made their first contribution
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0-alpha.1(Jan 11, 2022)

    • Bugfixes/avoid infinite loop when dataset loading (#934)
    • show nan instead of 0 for precision in summary (#930)
    • fix(api): include_metrics param only for search endponts (#929)
    • [Documentation] Update title page video for docs (#928)
    • update skweak tutorial (#922)
    • [Documentation] Updating the web app docu (#827)
    • revert test.pypi publish
    • publish python package to test.pypi for master and releases branches (#927)
    • [WeakLabels] Align WeakLabels.summary() with web app (#925)
    • UI: show rules without precision properly (#919)
    • chore(build): build docker images for release branches (#921)
    • Docs: Updates readme front video (#923)
    • Docs: Updates weak supervision resources (#920)
    • feat(rules): compute total & ann. coverage before label selection (#916)
    • fix(rules): compute annotated coverage when no label properly (#915)
    • Tutorial: Human-in-the-loop weak supervision with skweak (#869)
    • UI: include affected #records to overall coverage/ann. coverage metrics (#914)
    • fix lint build (#913)
    • UI: manage precision and rules without annotation coverage (#909)
    • fix(#876): process 400 response detail properly (#889)
    • feat(rules): allow compute partial query rule metrics (#907)
    • fix(security): providing default workspace should pass check (#911)
    • UI: reset filters from define rules view (#908)
    • UI: Show number of created rules in rules management view (#910)
    • UI: drop access to rule name field (#904)
    • fix(rules): prevent lost rules with dataset updates (#892)
    • fix(datasets): process owner as part of dataset id (#870)
    • (UI) Rules summary metrics format (#888)
    • UI: Improve code snippet for empty workspace (#886)
    • fix(UI): Remove case sensitive when filtering labels (#882)
    • Docs: Updates Flair zeroshot tutorial (#887)
    • removing wrong video (#885)
    • Update readme (#883)
    • fix(UI) Metrics value by default if no metric (#875)
    • feat(metrics): add token level metrics for token classification from client (#849)
    • UI: New rule metrics layout (#861)
    • chore: expose load_rules from base module (#866)
    • Docs: Regenerates graphs metrics guide (#865)
    • updating loss video (#864)
    • Docs: Update weak supervision guide (#863)
    • Update README.md (#862)
    • Fix: Link loss tutorial (#859)
    • Docs: Improve loss tutorial (#858)
    • Docs: Improve AL and ws tutorials (#857)
    • chore(ci): Include component testing configuration (#839)
    • fix/loss video updated (#853)
    • Docs: Weak supervision guide update (#855)
    • chore(app): upgrade lint dependencies (#841)
    • feat: weak supervision mode (#814)
    • Docs: Review hf tutorial (#852)
    • fix: error link to workspace home (#845)
    • fix(metrics): compute token length for each token (#850)
    • chore: improve dockerignore files
    • add streaming (#851)
    • fix(rules): prevent division by 0 for overall metrics (#848)
    • small change
    • [Tutorials] Update media structure, remove TLDR heading (#847)
    • Updating videos and images for sentiment classification tutorial (#846)
    • fix(rules): prevent division by zero (#843)
    • new folder and videos for model loss tutorial (#805)
    • feat(token class): add metrics at token level (#838)
    • new folder and images for active learning tutorial (#796)
    • [Tutorials] Typo fix in find label errors tutorial (#842)
    • [Tutorials] Add the new find_label_errors tutorial (#833)
    • [Rule] Modify the client API to the server's weak supervision feature (#840)
    • [LabelModel] Improve Snorkel to not modify the passed in WeakLabels object (#836)
    • feat (search): allow to filtering record metrics fields in search (#837)
    • fix(ui): remove workspace home from code snippet api url (#834)
    • ui: Hide validate button for binary cases in Text classifier (#830)
    • fix print message (#829)
    • feat: Include workspace in url path (#820)
    • fix(ui): align records and global action layouts #825
    • fix(ui): Show labels as selected after validate (#826)
    • feat(labeling rule): implements api endpoint to fetch a single rule (#817)
    • [LabelErrors] Add find_label_errors method (#775)
    • fix(ui): Fix styles in Safari (#815)
    • docs: Add contributors to readme (#822)
    • add missing rubrix import (#819)
    • new folder and images for spacy tutorial (#794)
    • feat(labeling rules): allow edition for rule label and description (#813)
    • refactor(labeling rules): optional label for rule metrics (#811)
    • Fix token alignment on CreationTokenClassificationRecord (#812)
    • feat(server): add overall dataset labeling rules metrics (#807)
    • feat(labeling rules): add coverage for annotated records (#806)
    • fix(ui): Unique ID for scroll state to avoid same state for different dataset records (#809)
    • new folder and images for zeroshot ner tutorial (#804)
    • new folder and images for zeroshot data annotation tutorial (#803)
    • fix(log): check multi-label integrity without search aggregations (#802)
    • updated images, added folder for fastapi tutorial (#801)
    • added folder for weak supervision tutorial (#795)
    • feat(weak supervision): client labeling rules from server (#799)
    • feat(server): labeling rule metrics (#790)
    • fix/edit zero-shot tutorial (#774)
    • fix/edited fastapi tutorial (#773)
    • Fix/edit ner flair tutorial (#766)
    • Fix/edit weaksupervision tutorial (#759)
    • fix(ui): Little changes in fonts (#793)
    • fix(ui): Allow open dataset in new tab from datasets list (#792)
    • feat(server): rubrix namespaces for elasticsearch indices (#789)
    • fix(ui): Show annotation after global validation (#786)
    • remove reload arg launching server using python (#787)
    • updated readme with conda install instruction (#788)
    • fix(ui): Hide scroller component when loading or paginate (#784)
    • fix(ui): allow remove metadata filter from record metadata modal (#772)
    • fix(ui): Token Classifier: validate record without annotation or prediction (#782)
    • Fix/edit active learning tutorial (#760)
    • Docs:minor changes to loss tutorial (#778)
    • Fix/edit model loss tutorial (#767)
    • fix(server): missing deprecated dep (#777)
    • fix(ui): Global validate for records without annotation or prediction (#746)
    • Fix/edit spacy tutorial (#758)
    • Fix/edit labeling tutorial (#750)
    • fix(server) - misaligned entity mentions on CreationTokenClassificationRecord (#771)
    • [Requirements] Require python>=3.7 (#770)
    • [Labeling] Add FlyingSquid label model (#755)
    • Update README.md (#769)
    • Adds Flair example to guide (#762)
    • docs: Updates huggingface examples and adds monitor for Flair (#761)
    • feat(search): show boolean values in metadata (#753)
    • feat(server): allow handle labeling rules for datasets from API (#744)
    • fix(imports): import monitoring with spacy<3.0 fails (#754)
    • [UI] new fonts families (#751)
    • fix(scroll): using new scroll component (#710)
    • fix(ui): filter "validatable" records for global action validate button (#741)
    • feat(monitor): flair ner auto-monitor (#738)

    Full Changelog: https://github.com/recognai/rubrix/compare/v0.7.0...v0.8.0-alpha.0

    Source code(tar.gz)
    Source code(zip)
Owner
Recognai
A software company building Natural Language Processing and Machine Learning tools
Recognai
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Exploring dimension-reduced embeddings

sleepwalk Exploring dimension-reduced embeddings This is the code repository. See here for the Sleepwalk web page. License and disclaimer This program

S. Anders's research group at ZMBH 91 Nov 29, 2022
Beautiful visualizations of how language differs among document types.

Scattertext 0.1.0.0 A tool for finding distinguishing terms in corpora and displaying them in an interactive HTML scatter plot. Points corresponding t

Jason S. Kessler 2k Dec 27, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
Searching keywords in PDF file folders

keyword_searching Steps to use this Python scripts: (1)Paste this script into the file folder containing the PDF files you need to search from; (2)Thi

1 Nov 08, 2021
Collection of scripts to pinpoint obfuscated code

Obfuscation Detection (v1.0) Author: Tim Blazytko Automatically detect control-flow flattening and other state machines Description: Scripts and binar

Tim Blazytko 230 Nov 26, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

FALLABOUT-SRMMIC 21 POETRY-GENERATION HINGLISH DESCRIPTION We have developed a NLP(natural language processing) model which automatically generates a

7 Sep 28, 2021
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
Yodatranslator is a simple translator English to Yoda-language

yodatranslator Overview yodatranslator is a simple translator English to Yoda-language. Project is created for educational purposes. It is intended to

1 Nov 11, 2021
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
Code for the paper PermuteFormer

PermuteFormer This repo includes codes for the paper PermuteFormer: Efficient Relative Position Encoding for Long Sequences. Directory long_range_aren

Peng Chen 42 Mar 16, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022