A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Related tags

Text Data & NLPblurr
Overview

blurr

A library that integrates huggingface transformers with version 2 of the fastai framework

Install

You can now pip install blurr via pip install ohmeow-blurr

Or, even better as this library is under very active development, create an editable install like this:

git clone https://github.com/ohmeow/blurr.git
cd blurr
pip install -e ".[dev]"

How to use

The initial release includes everything you need for sequence classification and question answering tasks. Support for token classification and summarization are incoming. Please check the documentation for more thorough examples of how to use this package.

The following two packages need to be installed for blurr to work:

  1. fastai2 (see http://docs.fast.ai/ for installation instructions)
  2. huggingface transformers (see https://huggingface.co/transformers/installation.html for details)

Imports

import torch
from transformers import *
from fastai.text.all import *

from blurr.data.all import *
from blurr.modeling.all import *

Get your data

path = untar_data(URLs.IMDB_SAMPLE)

model_path = Path('models')
imdb_df = pd.read_csv(path/'texts.csv')

Get n_labels from data for config later

n_labels = len(imdb_df['label'].unique())

Get your πŸ€— objects

model_cls = AutoModelForSequenceClassification

pretrained_model_name = "bert-base-uncased"

config = AutoConfig.from_pretrained(pretrained_model_name)
config.num_labels = n_labels

hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=model_cls, config=config)

Build your Data 🧱 and your DataLoaders

# single input
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock)
dblock = DataBlock(blocks=blocks,  get_x=ColReader('text'), get_y=ColReader('label'), splitter=ColSplitter())

dls = dblock.dataloaders(imdb_df, bs=4)
dls.show_batch(dataloaders=dls, max_n=2)
text target
0 raising victor vargas : a review < br / > < br / > you know, raising victor vargas is like sticking your hands into a big, steaming bowl of oatmeal. it's warm and gooey, but you're not sure if it feels right. try as i might, no matter how warm and gooey raising victor vargas became i was always aware that something didn't quite feel right. victor vargas suffers from a certain overconfidence on the director's part. apparently, the director thought that the ethnic backdrop of a latino family on the lower east side, and an idyllic storyline would make the film critic proof. he was right, but it didn't fool me. raising victor vargas is the story about a seventeen - year old boy called, you guessed it, victor vargas ( victor rasuk ) who lives his teenage years chasing more skirt than the rolling stones could do in all the years they've toured. the movie starts off in ` ugly fat'donna's bedroom where victor is sure to seduce her, but a cry from outside disrupts his plans when his best - friend harold ( kevin rivera ) comes - a - looking for him. caught in the attempt by harold and his sister, victor vargas runs off for damage control. yet even with the embarrassing implication that he's been boffing the homeliest girl in the neighborhood, nothing dissuades young victor from going off on the hunt for more fresh meat. on a hot, new york city day they make way to the local public swimming pool where victor's eyes catch a glimpse of the lovely young nymph judy ( judy marte ), who's not just pretty, but a strong and independent too. the relationship that develops between victor and judy becomes the focus of the film. the story also focuses on victor's family that is comprised of his grandmother or abuelita ( altagracia guzman ), his brother nino ( also played by real life brother to victor, silvestre rasuk ) and his sister vicky ( krystal rodriguez ). the action follows victor between scenes with judy and scenes with his family. victor tries to cope with being an oversexed pimp - daddy, his feelings for judy and his grandmother's conservative catholic upbringing. < br / > < br / > the problems that arise from raising victor vargas are a few, but glaring errors. throughout the film you get to know certain characters like vicky, nino, grandma, judy and even negative
1 many neglect that this isn't just a classic due to the fact that it's the first 3d game, or even the first shoot -'em - up. it's also one of the first stealth games, one of the only ( and definitely the first ) truly claustrophobic games, and just a pretty well - rounded gaming experience in general. with graphics that are terribly dated today, the game thrusts you into the role of b. j. ( don't even * think * i'm going to attempt spelling his last name! ), an american p. o. w. caught in an underground bunker. you fight and search your way through tunnels in order to achieve different objectives for the six episodes ( but, let's face it, most of them are just an excuse to hand you a weapon, surround you with nazis and send you out to waste one of the nazi leaders ). the graphics are, as i mentioned before, quite dated and very simple. the least detailed of basically any 3d game released by a professional team of creators. if you can get over that, however ( and some would suggest that this simplicity only adds to the effect the game has on you ), then you've got one heck of a good shooter / sneaking game. the game play consists of searching for keys, health and ammo, blasting enemies ( aforementioned nazis, and a " boss enemy " per chapter ) of varying difficulty ( which, of course, grows as you move further in the game ), unlocking doors and looking for secret rooms. there is a bonus count after each level is beaten... it goes by how fast you were ( basically, if you beat the'par time ', which is the time it took a tester to go through the same level ; this can be quite fun to try and beat, and with how difficult the levels are to find your way in, they are even challenging after many play - throughs ), how much nazi gold ( treasure ) you collected and how many bad guys you killed. basically, if you got 100 % of any of aforementioned, you get a bonus, helping you reach the coveted high score placings. the game ( mostly, but not always ) allows for two contrastingly different methods of playing... stealthily or gunning down anything and everything you see. you can either run or walk, and amongst your weapons is also a knife... running is heard instantly the moment you enter the same room as the guard, as positive

... and πŸš‚

#slow
model = HF_BaseModelWrapper(hf_model)

learn = Learner(dls, 
                model,
                opt_func=partial(Adam, decouple_wd=True),
                loss_func=CrossEntropyLossFlat(),
                metrics=[accuracy],
                cbs=[HF_BaseModelCallback],
                splitter=hf_splitter)

learn.freeze()

learn.fit_one_cycle(3, lr_max=1e-3)
epoch train_loss valid_loss accuracy time
0 0.594905 0.374806 0.850000 00:21
1 0.348940 0.413091 0.830000 00:21
2 0.288840 0.270606 0.905000 00:21
#slow
learn.show_results(learner=learn, max_n=2)
text target prediction
0 the trouble with the book, " memoirs of a geisha " is that it had japanese surfaces but underneath the surfaces it was all an american man's way of thinking. reading the book is like watching a magnificent ballet with great music, sets, and costumes yet performed by barnyard animals dressed in those costumesso far from japanese ways of thinking were the characters. < br / > < br / > the movie isn't about japan or real geisha. it is a story about a few american men's mistaken ideas about japan and geisha filtered through their own ignorance and misconceptions. so what is this movie if it isn't about japan or geisha? is it pure fantasy as so many people have said? yes, but then why make it into an american fantasy? < br / > < br / > there were so many missed opportunities. imagine a culture where there are no puritanical hang - ups, no connotations of sin about sex. sex is natural and normal. how is sex handled in this movie? right. like it was dirty. the closest thing to a sex scene in the movie has sayuri wrinkling up her nose and grimacing with distaste for five seconds as if the man trying to mount her had dropped a handful of cockroaches on her crotch. < br / > < br / > does anyone actually enjoy sex in this movie? nope. one character is said to be promiscuous but all we see is her pushing away her lover because it looks like she doesn't want to get caught doing something dirty. such typical american puritanism has no place in a movie about japanese geisha. < br / > < br / > did sayuri enjoy her first ravishing by some old codger after her cherry was auctioned off? nope. she lies there like a cold slab of meat on a chopping block. of course she isn't supposed to enjoy it. and that is what i mean about this movie. why couldn't they have given her something to enjoy? why does all the sex have to be sinful and wrong? < br / > < br / > behind mameha the chairman was sayuri's secret patron, and as such he was behind the auction of her virginity. he could have rigged the auction and won her himself. nobu didn't even bid. so why did the chairman let that old codger win her and, reeking of old - man stink, negative negative
1 < br / > < br / > i'm sure things didn't exactly go the same way in the real life of homer hickam as they did in the film adaptation of his book, rocket boys, but the movie " october sky " ( an anagram of the book's title ) is good enough to stand alone. i have not read hickam's memoirs, but i am still able to enjoy and understand their film adaptation. the film, directed by joe johnston and written by lewis colick, records the story of teenager homer hickam ( jake gyllenhaal ), beginning in october of 1957. it opens with the sound of a radio broadcast, bringing news of the russian satellite sputnik, the first artificial satellite in orbit. we see a images of a blue - gray town and its people : mostly miners working for the olga coal company. one of the miners listens to the news on a hand - held radio as he enters the elevator shaft, but the signal is lost as he disappears into the darkness, losing sight of the starry sky above him. a melancholy violin tune fades with this image. we then get a jolt of elvis on a car radio as words on the screen inform us of the setting : october 5, 1957, coalwood, west virginia. homer and his buddies, roy lee cook ( william lee scott ) and sherman o'dell ( chad lindberg ), are talking about football tryouts. football scholarships are the only way out of the town, and working in the mines, for these boys. " why are the jocks the only ones who get to go to college, " questions homer. roy lee replies, " they're also the only ones who get the girls. " homer doesn't make it in football like his older brother, so he is destined for the mines, and to follow in his father's footsteps as mine foreman. until he sees the dot of light streaking across the october sky. then he wants to build a rocket. " i want to go into space, " says homer. after a disastrous attempt involving a primitive rocket and his mother's ( natalie canerday ) fence, homer enlists the help of the nerdy quentin wilson ( chris owen ). quentin asks homer, " what do you want to know about rockets? " homer quickly anwers, " everything. " his science teacher at big creek high school, miss frieda riley ( laura dern ) greatly supports homer, and positive positive

Using the high-level Blurr API

Using the high-level API we can reduce DataBlock, DataLoaders, and Learner creation into a single line of code.

Included in the high-level API is a general BLearner class (pronouned "Blurrner") that you can use with hand crafted DataLoaders, as well as, task specific BLearners like BLearnerForSequenceClassification that will handle everything given your raw data sourced from a pandas DataFrame, CSV file, or list of dictionaries (for example a huggingface datasets dataset)

#slow
learn = BlearnerForSequenceClassification.from_dataframe(imdb_df, pretrained_model_name, dl_kwargs={ 'bs': 4})
#slow
learn.fit_one_cycle(1, lr_max=1e-3)
epoch train_loss valid_loss f1_score accuracy time
0 0.532659 0.433739 0.819672 0.835000 00:21
#slow
learn.show_results(learner=learn, max_n=2)
text target prediction
0 the trouble with the book, " memoirs of a geisha " is that it had japanese surfaces but underneath the surfaces it was all an american man's way of thinking. reading the book is like watching a magnificent ballet with great music, sets, and costumes yet performed by barnyard animals dressed in those costumesso far from japanese ways of thinking were the characters. < br / > < br / > the movie isn't about japan or real geisha. it is a story about a few american men's mistaken ideas about japan and geisha filtered through their own ignorance and misconceptions. so what is this movie if it isn't about japan or geisha? is it pure fantasy as so many people have said? yes, but then why make it into an american fantasy? < br / > < br / > there were so many missed opportunities. imagine a culture where there are no puritanical hang - ups, no connotations of sin about sex. sex is natural and normal. how is sex handled in this movie? right. like it was dirty. the closest thing to a sex scene in the movie has sayuri wrinkling up her nose and grimacing with distaste for five seconds as if the man trying to mount her had dropped a handful of cockroaches on her crotch. < br / > < br / > does anyone actually enjoy sex in this movie? nope. one character is said to be promiscuous but all we see is her pushing away her lover because it looks like she doesn't want to get caught doing something dirty. such typical american puritanism has no place in a movie about japanese geisha. < br / > < br / > did sayuri enjoy her first ravishing by some old codger after her cherry was auctioned off? nope. she lies there like a cold slab of meat on a chopping block. of course she isn't supposed to enjoy it. and that is what i mean about this movie. why couldn't they have given her something to enjoy? why does all the sex have to be sinful and wrong? < br / > < br / > behind mameha the chairman was sayuri's secret patron, and as such he was behind the auction of her virginity. he could have rigged the auction and won her himself. nobu didn't even bid. so why did the chairman let that old codger win her and, reeking of old - man stink, negative negative
1 < br / > < br / > i'm sure things didn't exactly go the same way in the real life of homer hickam as they did in the film adaptation of his book, rocket boys, but the movie " october sky " ( an anagram of the book's title ) is good enough to stand alone. i have not read hickam's memoirs, but i am still able to enjoy and understand their film adaptation. the film, directed by joe johnston and written by lewis colick, records the story of teenager homer hickam ( jake gyllenhaal ), beginning in october of 1957. it opens with the sound of a radio broadcast, bringing news of the russian satellite sputnik, the first artificial satellite in orbit. we see a images of a blue - gray town and its people : mostly miners working for the olga coal company. one of the miners listens to the news on a hand - held radio as he enters the elevator shaft, but the signal is lost as he disappears into the darkness, losing sight of the starry sky above him. a melancholy violin tune fades with this image. we then get a jolt of elvis on a car radio as words on the screen inform us of the setting : october 5, 1957, coalwood, west virginia. homer and his buddies, roy lee cook ( william lee scott ) and sherman o'dell ( chad lindberg ), are talking about football tryouts. football scholarships are the only way out of the town, and working in the mines, for these boys. " why are the jocks the only ones who get to go to college, " questions homer. roy lee replies, " they're also the only ones who get the girls. " homer doesn't make it in football like his older brother, so he is destined for the mines, and to follow in his father's footsteps as mine foreman. until he sees the dot of light streaking across the october sky. then he wants to build a rocket. " i want to go into space, " says homer. after a disastrous attempt involving a primitive rocket and his mother's ( natalie canerday ) fence, homer enlists the help of the nerdy quentin wilson ( chris owen ). quentin asks homer, " what do you want to know about rockets? " homer quickly anwers, " everything. " his science teacher at big creek high school, miss frieda riley ( laura dern ) greatly supports homer, and positive positive

❗ Updates

09/06/2021 - v0.1.0

  • Complete overhaul of documentation for entire library (using nbverbose)
  • Updated all the nbdev bits and users now have the ability to open any doc in colab (H/T Zach Mueller)
  • Added calc_every argument to the HF_Seq2SeqMetricsCallback so that you can speed up training by NOT calculating the seq2seq metrics on every epoch (this can be time consuming).
  • Misc. bug fixes and addition of other helper methods throughout the library

08/24/2021 - v0.0.33

  • Complete overhaul of documentation for sequence classification bits
  • Finished low-level API to support Blurr training with PyTorch and/or fast.ai Datasets/DataLoaders
  • Misc. bug fixes

07/11/2021 - v0.0.30

  • Finished initial Blearner high-level API for all Blurr supported tasks
  • Finished high-level APIs examples for all Blurr supported tasks
  • Fixed squad preprocessing

07/01/2021 - v0.0.29

  • Updated to work with tranformers 4.8
  • Introducing the Blearner high-level API with task specific blearners for building your DataBlock, DataLoaders, and Learner in one line of code (usually :))
  • Added LOTS of examples (using low/high-level APIs, using Hugging Face datasets, and handling all the GLUE tasks)
  • Updated setup.py so you can now use Blurr on Windows (H/T to @EinAeffchen for the fix)

06/16/2021

  • Updated to work with fastai 2.4
  • Removed blurr_summary as Learner.summary works with fastai 2.4
  • Updated Learner.lr_find code in docs to the updated API in fastai 2.4

06/10/2021

  • Updated to work with Huggingface 4.6.x
  • Added MLM fine-tuning
  • Reorganized code/docs

05/04/2021

The "May the Fourth be with you" release:

  • Updated to work with Huggingface 4.5.x and Fastai 2.3.1 (there is a bug in 2.3.0 that breaks blurr so make sure you are using the latest)

  • Fixed Github issues #36, #34

  • Misc. improvements to get blurr in line with the upcoming Huggingface 5.0 release

  • A few breaking changes:

  1. BLURR_MODEL_HELPER is now just BLURR

  2. Task specific auto models need to be built using the new Huggingface AutoModelFor<Insert task here> classes. The docs have been updated to show you how it works; the prior way of building such models not longer works.

12/31/2020

The "Goodbye 2020" release with lots of goodies for blurr users:

  • Updated the Seq2Seq models to use some of the latest huggingface bits like tokenizer.prepare_seq2seq_batch.
  • Separated out the Seq2Seq and Token Classification metrics into metrics-specific callbacks for a better separation of concerns. As a best practice, you should now only use them as fit_one_cycle, etc.. callbacks rather than attach them to your Learner.
  • NEW: Translation are now available in blurr, joining causal language modeling and summarization in our core Seq2Seq stack
  • NEW: Integration of huggingface's Seq2Seq metrics (rouge, bertscore, meteor, bleu, and sacrebleu). Plenty of info on how to set this up in the docs.
  • NEW: Added default_text_gen_kwargs, a method that given a huggingface config, model, and task (optional), will return the default/recommended kwargs for any text generation models.
  • A lot of code cleanup (e.g., refactored naming and removal of redundant code into classes/methods)
  • More model support and more tests across the board! Check out the docs for more info
  • Misc. validation improvements and bug fixes.

As I'm sure there is plenty I can do to make this library better, please don't hesitate to join in and help the effort by submitting PRs, pointing out problems with my code, or letting me know what and how I can improve things generally. Some models, like mbart and mt5 for example, aren't giving good results and I'd love to get any and all feedback from the community on how to resolve such issues ... so hit me up, I promise I won't bit :)

12/20/2020

  • Updated Learner.blurr_predict and Learner.blurr_predict_tokens to support single or multiple items
  • Added ONNX support for sequence classification, token classification, and question/answering tasks. blurrONNX provides ONNX friendly variants of Learner.blurr_predict and Learner.blurr_predict_tokens in the form of blurrONNX.predict and blurrONNX.predict_tokens respectively. Like their Learner equivalents, these methods support single or multiple items for inferece. See the docs/code for examples and speed gains you get with ONNX.
  • Added quantization support when converting your blurr models to ONNX.
  • Requires fast.ai >= 2.1.5 and huggingface transformers >= 4.x

12/12/2020

  • Updated to work with the latest version of fast.ai (2.1.8) and huggingface transformers >= 4.x
  • Fixed Learner.blurr_summary to work with fast.ai >= 2.1.8
  • Fixed inclusion of add_prefix_space in tokenizer BLURR_MODEL_HELPER
  • Fixed token classification show_results for tokenizers that add a prefix space
  • Notebooks run with environment variable "TOKENIZERS_PARALLELISM=false" to avoid fast tokenizer warnings
  • Updated docs

11/12/2020

  • Updated documentation
  • Updated model callbacks to support mixed precision training regardless of whether you are calculating the loss yourself or letting huggingface do it for you.

11/10/2020

  • Major update just about everywhere to facilitate a breaking change in fastai's treatment of before_batch transforms.
  • Reorganized code as I being to work on LM and other text2text tasks
  • Misc. fixes

10/08/2020

  • Updated all models to use ModelOutput classes instead of traditional tuples. ModelOutput attributes are assigned to the appropriate fastai bits like Learner.pred and Learner.loss and anything else you've requested the huggingface model to return is available via the Learner.blurr_model_outputs dictionary (see next two bullet items)
  • Added ability to grab attentions and hidden state from Learner. You can get at them via Learner.blurr_model_outputs dictionary if you tell HF_BaseModelWrapper to provide them.
  • Added model_kwargs to HF_BaseModelWrapper should you need to request a huggingface model to return something specific to it's type. These outputs will be available via the Learner.blurr_model_outputs dictionary as well.

09/16/2020

  • Major overhaul to do everything at batch time (including tokenization/numericalization). If this backfires, I'll roll everything back but as of now, I think this approach not only meshes better with how huggingface tokenization works and reduce RAM utilization for big datasets, but also opens up opportunities for incorporating augmentation, building adversarial models, etc.... Thoughts?
  • Added tests for summarization bits
  • New change may require some small modifications (see docs or ask on issues thread if you have problems you can't fiture out). I'm NOT doing a release until pypi until folks have a chance to work with the latest.

09/07/2020

  • Added tests for question/answer and summarization transformer models
  • Updated summarization to support BART, T5, and Pegasus

08/20/2020

  • Updated everything to work latest version of fastai (tested against 2.0.0)
  • Added batch-time padding, so that by default now, HF_TokenizerTransform doesn't add any padding tokens and all huggingface inputs are padded simply to the max sequence length in each batch rather than to the max length (passed in and/or acceptable to the model). This should create efficiencies across the board, from memory consumption to GPU utilization. The old tried and true method of padding during tokenization requires you to pass in padding='max_length to HF_TextBlock.
  • Removed code to remove fastai2 @patched summary methods which had previously conflicted with a couple of the huggingface transformers

08/13/2020

  • Updated everything to work latest transformers and fastai
  • Reorganized code to bring it more inline with how huggingface separates out their "tasks".

07/06/2020

  • Updated everything to work huggingface>=3.02
  • Changed a lot of the internals to make everything more efficient and performant along with the latest version of huggingface ... meaning, I have broken things for folks using previous versions of blurr :).

06/27/2020

  • Simplified the BLURR_MODEL_HELPER.get_hf_objects method to support a wide range of options in terms of building the necessary huggingface objects (architecture, config, tokenizer, and model). Also added cache_dir for saving pre-trained objects in a custom directory.
  • Misc. renaming and cleanup that may break existing code (please see the docs/source if things blow up)
  • Added missing required libraries to requirements.txt (e.g., nlp)

05/23/2020

  • Initial support for text generation (e.g., summarization, conversational agents) models now included. Only tested with BART so if you try it with other models before I do, lmk what works ... and what doesn't

05/17/2020

  • Major code restructuring to make it easier to build out the library.
  • HF_TokenizerTransform replaces HF_Tokenizer, handling the tokenization and numericalization in one place. DataBlock code has been dramatically simplified.
  • Tokenization correctly handles huggingface tokenizers that require add_prefix_space=True.
  • HF_BaseModelCallback and HF_BaseModelCallback are required and work together in order to allow developers to tie into any callback friendly event exposed by fastai2 and also pass in named arguments to the huggingface models.
  • show_batch and show_results have been updated for Question/Answer and Token Classification models to represent the data and results in a more easily intepretable manner than the defaults.

05/06/2020

  • Initial support for Token classification (e.g., NER) models now included
  • Extended fastai's Learner object with a predict_tokens method used specifically in token classification
  • HF_BaseModelCallback can be used (or extended) instead of the model wrapper to ensure your inputs into the huggingface model is correct (recommended). See docs for examples (and thanks to fastai's Sylvain for the suggestion!)
  • HF_Tokenizer can work with strings or a string representation of a list (the later helpful for token classification tasks)
  • show_batch and show_results methods have been updated to allow better control on how huggingface tokenized data is represented in those methods

⭐ Props

A word of gratitude to the following individuals, repos, and articles upon which much of this work is inspired from:

Comments
  • Value Error when importing blurr.modeling.all and blurr.data.all

    Value Error when importing blurr.modeling.all and blurr.data.all

    After pip install in Colab, import blurr.modeling.all and blurr.data.all are throwing errors. Did not encounter errors last week, not sure when they began. Has there been a syntax change for setup?


    ValueError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance) 354 try: --> 355 return self._range.index(new_key) 356 except ValueError as err:

    ValueError: 1 is not in range

    The above exception was the direct cause of the following exception:

    KeyError Traceback (most recent call last) 6 frames in () ----> 1 from blurr.modeling.all import *

    /usr/local/lib/python3.6/dist-packages/blurr/modeling/all.py in () ----> 1 from ..utils import * 2 from .core import * 3 from .question_answering import * 4 from .token_classification import * 5 from .text2text.core import *

    /usr/local/lib/python3.6/dist-packages/blurr/utils.py in () 182 183 # Cell --> 184 BLURR_MODEL_HELPER = ModelHelper() 185 186 # Cell

    /usr/local/lib/python3.6/dist-packages/blurr/utils.py in call(self, *args, **kwargs) 25 26 def call(self, *args, **kwargs): ---> 27 if self._instance == None: self._instance = self._cls(*args, **kwargs) 28 return self._instance 29

    /usr/local/lib/python3.6/dist-packages/blurr/utils.py in init(self) 59 model_type_df = self._df[(self._df.functional_area == 'modeling')].class_name.str.split('For', n=1, expand=True) 60 ---> 61 model_type_df[1] = np.where(model_type_df[1].notnull(), 62 'For' + model_type_df[1].astype(str), 63 model_type_df[1])

    /usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in getitem(self, key) 2904 if self.columns.nlevels > 1: 2905 return self._getitem_multilevel(key) -> 2906 indexer = self.columns.get_loc(key) 2907 if is_integer(indexer): 2908 indexer = [indexer]

    /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance) 355 return self._range.index(new_key) 356 except ValueError as err: --> 357 raise KeyError(key) from err 358 raise KeyError(key) 359 return super().get_loc(key, method=method, tolerance=tolerance)

    KeyError: 1

    opened by JeremyOBrien16 15
  • Causal Language Modelling from files

    Causal Language Modelling from files

    Hi @ohmeow , this is probably a basic question, but I'm having some issues to do causal language modelling from a set of wikitext-100 style files. My data folder is split as follows:

    data/
           - train/
               - file1.txt
               - file2.txt
               - file3.txt
               - ..... 
           - valid/
               - file1.txt
               - file2.txt
               - file3.txt
               - .....
    

    So far I have been following your gpt2 tutorial for LM:

    pretrained_model_name = "gpt2"
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(pretrained_model_name, model_cls=AutoModelForCausalLM)
    if (hf_tokenizer.pad_token is None): hf_tokenizer.pad_token = '[PAD]'
    before_batch_tfm = HF_LMBeforeBatchTransform(hf_arch, hf_config, hf_tokenizer, hf_model, lm_strategy_cls=CausalLMStrategy)
    blocks = [HF_TextBlock(before_batch_tfm=before_batch_tfm, input_return_type=HF_CausalLMInput), noop]
    path = config["data_path"]
    get_items = partial(get_text_files, folders=[train, valid])
    dblock = DataBlock(blocks=blocks, get_items=get_items, get_y=None, splitter=config["splitter"])
    dls = TextDataLoaders.from_dblock(dblock, path, path=path, seq_len=config["max_seq_len"])
    

    Config is a dict containing various params and a splitter class that's able to get the validation / train sets from a the data folder passed in.

    This is however, throwing the following exception:

    ~/code/tgalery/fastai/fastai/data/core.py in from_dblock(cls, dblock, source, path, bs, val_bs, shuffle, device, **kwargs)
        190     @classmethod
        191     def from_dblock(cls, dblock, source, path='.',  bs=64, val_bs=None, shuffle=True, device=None, **kwargs):
    --> 192         return dblock.dataloaders(source, path=path, bs=bs, val_bs=val_bs, shuffle=shuffle, device=device, **kwargs)
        193 
        194     _docs=dict(__getitem__="Retrieve `DataLoader` at `i` (`0` is training, `1` is validation)",
    
    ~/code/tgalery/fastai/fastai/data/block.py in dataloaders(self, source, path, verbose, **kwargs)
        113         dsets = self.datasets(source, verbose=verbose)
        114         kwargs = {**self.dls_kwargs, **kwargs, 'verbose': verbose}
    --> 115         return dsets.dataloaders(path=path, after_item=self.item_tfms, after_batch=self.batch_tfms, **kwargs)
        116 
        117     _docs = dict(new="Create a new `DataBlock` with other `item_tfms` and `batch_tfms`",
    
    ~/code/tgalery/fastai/fastai/data/core.py in dataloaders(self, bs, shuffle_train, shuffle, val_shuffle, n, path, dl_type, dl_kwargs, device, drop_last, val_bs, **kwargs)
        229         val_kwargs={k[4:]:v for k,v in kwargs.items() if k.startswith('val_')}
        230         def_kwargs = {'bs':bs,'shuffle':shuffle,'drop_last':drop_last,'n':n,'device':device}
    --> 231         dl = dl_type(self.subset(0), **merge(kwargs,def_kwargs, dl_kwargs[0]))
        232         def_kwargs = {'bs':bs if val_bs is None else val_bs,'shuffle':val_shuffle,'n':None,'drop_last':False}
        233         dls = [dl] + [dl.new(self.subset(i), **merge(kwargs,def_kwargs,val_kwargs,dl_kwargs[i]))
    
    ~/code/tgalery/fastai/fastai/text/data.py in __init__(self, dataset, sort_func, res, **kwargs)
        187         self.sort_func = _default_sort if sort_func is None else sort_func
        188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
    --> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
        190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)
        191 
    
    ~/code/tgalery/fastai/fastai/text/data.py in <listcomp>(.0)
        187         self.sort_func = _default_sort if sort_func is None else sort_func
        188         if res is None and self.sort_func == _default_sort: res = _get_lengths(dataset)
    --> 189         self.res = [self.sort_func(self.do_item(i)) for i in range_of(self.dataset)] if res is None else res
        190         if len(self.res) > 0: self.idx_max = np.argmax(self.res)
        191 
    
    ~/code/tgalery/fastai/fastai/text/data.py in _default_sort(x)
        178 
        179 # Cell
    --> 180 def _default_sort(x): return len(x[0])
        181 
        182 @delegates(TfmdDL)
    
    TypeError: object of type 'PosixPath' has no len()
    

    Any pointers on what I might be doing wrong ?

    opened by tgalery 10
  • How to make predictions on a batch?

    How to make predictions on a batch?

    Hi,

    I'm currently facing some errors when trying to generate predictions on a batch using fastai functions such as dls.test_dl() to create the test dataloaders and then learn.get_preds() to get all the predictions. I searched through the documentation but couldn't find any examples of batch prediction. Could you show me an example of how to do it.?

    opened by ncduy0303 8
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Hi there, I'm having a side assert cuda device on self.loss_grad.backward()

    https://gist.github.com/tyoc213/1527e3e26f0d037466077949becf0063

    It seems that this type of errors are normally about a size mismatch between tensors? but couldn't pinpoint more than that on this issue.

    opened by tyoc213 8
  • add pegasus support

    add pegasus support

    Pegasus has the same parameter groups in Bart so the splitter works for both.

    When Pegasus decodes it leaves these <n> symbols in, so I added a line the take them out.

    opened by HenryDashwood 8
  • How to do transfer learning on a pretrained downstream task model?

    How to do transfer learning on a pretrained downstream task model?

    Hi @ohmeow ,

    Is it possible to use blurr to do achieve this? I'm trying to do transfer learning on a pretrained NER task model with 39 labels (instead of a LM model like in your example) to a smaller NER dataset with only 5 labels. Unfortunately, I'm getting stuck on how to do so.

    I tried 2 different ways but all lead to errors:

    1. Fill in config.num_labels
    task = HF_TASKS_AUTO.TokenClassification
    pretrained_model_name = 'cahya/bert-base-indonesian-NER' 
    config = AutoConfig.from_pretrained(pretrained_model_name)
    config.num_labels = len(labels) # 5
    
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
                                                                                   task=task, 
                                                                                   config=config)
    
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-24-b2c1babcdb91> in <module>
          4 config.num_labels = len(labels) # 5
          5 
    ----> 6 hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, 
          7                                                                                task=task,
          8                                                                                config=config)
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/blurr/utils.py in get_hf_objects(self, pretrained_model_name_or_path, task, config, tokenizer_cls, model_cls, config_kwargs, tokenizer_kwargs, model_kwargs, cache_dir)
        175                 model_cls = self.get_models(arch="auto", task=task.name)[0]
        176 
    --> 177             model = model_cls.from_pretrained(pretrained_model_name_or_path,
        178                                               config=config,
        179                                               cache_dir=cache_dir,
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/transformers/models/auto/modeling_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
       1611 
       1612         if type(config) in MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING.keys():
    -> 1613             return MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING[type(config)].from_pretrained(
       1614                 pretrained_model_name_or_path, *model_args, config=config, **kwargs
       1615             )
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
       1155                 )
       1156             if len(error_msgs) > 0:
    -> 1157                 raise RuntimeError(
       1158                     "Error(s) in loading state_dict for {}:\n\t{}".format(
       1159                         model.__class__.__name__, "\n\t".join(error_msgs)
    
    RuntimeError: Error(s) in loading state_dict for BertForTokenClassification:
    	size mismatch for classifier.weight: copying a param with shape torch.Size([39, 768]) from checkpoint, the shape in current model is torch.Size([5, 768]).
    	size mismatch for classifier.bias: copying a param with shape torch.Size([39]) from checkpoint, the shape in current model is torch.Size([5]).
    
    1. Add in an extra Linear(39, 5) layer at the end: model = HF_BaseModelWrapper(nn.Sequential(hf_model, nn.Linear(39, 5))) But when I call learn.fit(), it results in this:
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-16-8587f3539821> in <module>
    ----> 1 learn.fit(1)
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt)
        209             self.opt.set_hypers(lr=self.lr if lr is None else lr)
        210             self.n_epoch = n_epoch
    --> 211             self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
        212 
        213     def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
        158 
        159     def _with_events(self, f, event_type, ex, final=noop):
    --> 160         try: self(f'before_{event_type}');  f()
        161         except ex: self(f'after_cancel_{event_type}')
        162         self(f'after_{event_type}');  final()
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self)
        200         for epoch in range(self.n_epoch):
        201             self.epoch=epoch
    --> 202             self._with_events(self._do_epoch, 'epoch', CancelEpochException)
        203 
        204     def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False):
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
        158 
        159     def _with_events(self, f, event_type, ex, final=noop):
    --> 160         try: self(f'before_{event_type}');  f()
        161         except ex: self(f'after_cancel_{event_type}')
        162         self(f'after_{event_type}');  final()
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _do_epoch(self)
        194 
        195     def _do_epoch(self):
    --> 196         self._do_epoch_train()
        197         self._do_epoch_validate()
        198 
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _do_epoch_train(self)
        186     def _do_epoch_train(self):
        187         self.dl = self.dls.train
    --> 188         self._with_events(self.all_batches, 'train', CancelTrainException)
        189 
        190     def _do_epoch_validate(self, ds_idx=1, dl=None):
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
        158 
        159     def _with_events(self, f, event_type, ex, final=noop):
    --> 160         try: self(f'before_{event_type}');  f()
        161         except ex: self(f'after_cancel_{event_type}')
        162         self(f'after_{event_type}');  final()
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in all_batches(self)
        164     def all_batches(self):
        165         self.n_iter = len(self.dl)
    --> 166         for o in enumerate(self.dl): self.one_batch(*o)
        167 
        168     def _do_one_batch(self):
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in one_batch(self, i, b)
        182         self.iter = i
        183         self._split(b)
    --> 184         self._with_events(self._do_one_batch, 'batch', CancelBatchException)
        185 
        186     def _do_epoch_train(self):
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final)
        158 
        159     def _with_events(self, f, event_type, ex, final=noop):
    --> 160         try: self(f'before_{event_type}');  f()
        161         except ex: self(f'after_cancel_{event_type}')
        162         self(f'after_{event_type}');  final()
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/fastai/learner.py in _do_one_batch(self)
        167 
        168     def _do_one_batch(self):
    --> 169         self.pred = self.model(*self.xb)
        170         self('after_pred')
        171         if len(self.yb):
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        725             result = self._slow_forward(*input, **kwargs)
        726         else:
    --> 727             result = self.forward(*input, **kwargs)
        728         for hook in itertools.chain(
        729                 _global_forward_hooks.values(),
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/blurr/modeling/core.py in forward(self, x)
         41             if k not in self.hf_model_fwd_args: del x[k]
         42 
    ---> 43         return self.hf_model(**x,
         44                              output_hidden_states=self.output_hidden_states,
         45                              output_attentions=self.output_attentions,
    
    /opt/conda/envs/fastai/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
        725             result = self._slow_forward(*input, **kwargs)
        726         else:
    --> 727             result = self.forward(*input, **kwargs)
        728         for hook in itertools.chain(
        729                 _global_forward_hooks.values(),
    
    TypeError: forward() got an unexpected keyword argument 'output_hidden_states'
    
    opened by ncduy0303 7
  • TypeError: 'PosixPath' object is not iterable

    TypeError: 'PosixPath' object is not iterable

    I'm trying to build a dataloader with data from folders and facing this error:

    task = HF_TASKS_AUTO.SequenceClassification
    pretrained_model_name = "bert-base-uncased"
    hf_arch, hf_config, hf_tokenizer, hf_model = BLURR_MODEL_HELPER.get_hf_objects(pretrained_model_name, task=task)
    
    path = untar_data(URLs.IMDB)
    
    db = DataBlock(blocks=(HF_TextBlock(hf_arch=hf_arch, hf_tokenizer=hf_tokenizer), CategoryBlock), 
                   get_items=get_text_files,
                   splitter=GrandparentSplitter(train_name='train', valid_name='test'),
                   get_y=parent_label)
    
    db.summary(path)
    
    Setting-up type transforms pipelines
    Collecting items from /storage/data/imdb
    Found 100002 items
    2 datasets of sizes 25000,25000
    Setting up Pipeline: HF_TokenizerTransform
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-34-d2bf6908b967> in <module>
    ----> 1 db.summary(path)
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/block.py in summary(self, source, bs, show_batch, **kwargs)
        152     "Steps through the transform pipeline for one batch, and optionally calls `show_batch(**kwargs)` on the transient `Dataloaders`."
        153     print(f"Setting-up type transforms pipelines")
    --> 154     dsets = self.datasets(source, verbose=True)
        155     print("\nBuilding one sample")
        156     for tl in dsets.train.tls:
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/block.py in datasets(self, source, verbose)
        102         splits = (self.splitter or RandomSplitter())(items)
        103         pv(f"{len(splits)} datasets of sizes {','.join([str(len(s)) for s in splits])}", verbose)
    --> 104         return Datasets(items, tfms=self._combine_type_tfms(), splits=splits, dl_type=self.dl_type, n_inp=self.n_inp, verbose=verbose)
        105 
        106     def dataloaders(self, source, path='.', verbose=False, **kwargs):
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in __init__(self, items, tfms, tls, n_inp, dl_type, **kwargs)
        283     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
        284         super().__init__(dl_type=dl_type)
    --> 285         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
        286         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
        287 
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in <listcomp>(.0)
        283     def __init__(self, items=None, tfms=None, tls=None, n_inp=None, dl_type=None, **kwargs):
        284         super().__init__(dl_type=dl_type)
    --> 285         self.tls = L(tls if tls else [TfmdLists(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])
        286         self.n_inp = ifnone(n_inp, max(1, len(self.tls)-1))
        287 
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/foundation.py in __call__(cls, x, *args, **kwargs)
         45             return x
         46 
    ---> 47         res = super().__call__(*((x,) + args), **kwargs)
         48         res._newchk = 0
         49         return res
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in __init__(self, items, tfms, use_list, do_setup, split_idx, train_setup, splits, types, verbose, dl_type)
        221         if do_setup:
        222             pv(f"Setting up {self.tfms}", verbose)
    --> 223             self.setup(train_setup=train_setup)
        224 
        225     def _new(self, items, split_idx=None, **kwargs):
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastai2/data/core.py in setup(self, train_setup)
        243             for f in self.tfms.fs:
        244                 self.types.append(getattr(f, 'input_types', type(x)))
    --> 245                 x = f(x)
        246             self.types.append(type(x))
        247         types = L(t if is_listy(t) else [t] for t in self.types).concat().unique()
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/transform.py in __call__(self, x, **kwargs)
         70     @property
         71     def name(self): return getattr(self, '_name', _get_name(self))
    ---> 72     def __call__(self, x, **kwargs): return self._call('encodes', x, **kwargs)
         73     def decode  (self, x, **kwargs): return self._call('decodes', x, **kwargs)
         74     def __repr__(self): return f'{self.name}: {self.encodes} {self.decodes}'
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/transform.py in _call(self, fn, x, split_idx, **kwargs)
         80     def _call(self, fn, x, split_idx=None, **kwargs):
         81         if split_idx!=self.split_idx and self.split_idx is not None: return x
    ---> 82         return self._do_call(getattr(self, fn), x, **kwargs)
         83 
         84     def _do_call(self, f, x, **kwargs):
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/transform.py in _do_call(self, f, x, **kwargs)
         84     def _do_call(self, f, x, **kwargs):
         85         if not _is_tuple(x):
    ---> 86             return x if f is None else retain_type(f(x, **kwargs), x, f.returns_none(x))
         87         res = tuple(self._do_call(f, x_, **kwargs) for x_ in x)
         88         return retain_type(res, x)
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/fastcore/dispatch.py in __call__(self, *args, **kwargs)
         96         if not f: return args[0]
         97         if self.inst is not None: f = MethodType(f, self.inst)
    ---> 98         return f(*args, **kwargs)
         99 
        100     def __get__(self, inst, owner):
    
    /opt/conda/envs/fastai/lib/python3.7/site-packages/blurr/data/core.py in encodes(self, inp)
         30             toks = self.hf_tokenizer.tokenize(inp, add_prefix_space=self.add_prefix_space)
         31         else:
    ---> 32             toks = [sub_toks for entity in inp
         33                     for sub_toks in self.hf_tokenizer.tokenize(entity, add_prefix_space=self.add_prefix_space)]
         34 
    
    TypeError: 'PosixPath' object is not iterable
    
    opened by ncduy0303 7
  • Using fastai interpretation

    Using fastai interpretation

    Hey there,

    first: Thanks for your awesome tool, that makes everything work so much smoother. I successfully trained a model and want to evaluate its performance. I tried using interp = Interpretation.from_learner(learn), but this fails with an error:

    /usr/local/lib/python3.7/dist-packages/fastai/torch_core.py in nested_reorder(t, idxs)
        704     elif is_listy(t): return type(t)(nested_reorder(t_, idxs) for t_ in t)
        705     if t is None: return t
    --> 706     raise TypeError(f"Expected tensor, tuple, list or L but got {type(t)}")
        707 
        708 # Cell
    
    TypeError: Expected tensor, tuple, list or L but got <class 'dict'>
    

    Am I missing something or is this an error on my side? Thanks in Advance for your help.

    opened by hno2 6
  • Windows issue

    Windows issue

    Hi. Is there a way to fix this?

    Collecting git+https://github.com/ohmeow/blurr.git
      Cloning https://github.com/ohmeow/blurr.git to c:\users\runner~1\appdata\local\temp\pip-req-build-gaye0t66
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\RUNNER~1\AppData\Local\R-MINI~1\envs\R-RETI~1\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\pip-req-build-gaye0t66\\setup.py'"'"'; __file__='"'"'C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\pip-req-build-gaye0t66\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\RUNNER~1\AppData\Local\Temp\pip-pip-egg-info-mpwpr2p4'
             cwd: C:\Users\RUNNER~1\AppData\Local\Temp\pip-req-build-gaye0t66\
        Complete output (7 lines):
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "C:\Users\RUNNER~1\AppData\Local\Temp\pip-req-build-gaye0t66\setup.py", line 42, in <module>
            long_description = open('README.md').read(),
          File "C:\Users\RUNNER~1\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\encodings\cp1252.py", line 23, in decode
            return codecs.charmap_decode(input,self.errors,decoding_table)[0]
        UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 13891: character maps to <undefined>
    

    Github Actions: https://github.com/henry090/fastai/pull/58/checks?check_run_id=1367643542

    opened by turgut090 6
  • No module named 'nlp'

    No module named 'nlp'

    I got this when trying to import after installing from master pip install -e ".[dev]" I try to use it on a ipynb and got this:

    import torch
    from transformers import *
    from fastai2.text.all import *
    
    from blurr.data.all import *
    from blurr.modeling.all import *
    

    I get

    ---------------------------------------------------------------------------
    ModuleNotFoundError                       Traceback (most recent call last)
    <ipython-input-1-906f6e0a84bd> in <module>
          3 from fastai2.text.all import *
          4 
    ----> 5 from blurr.data.all import *
          6 from blurr.modeling.all import *
    
    ~/Documentos/github/blurr/blurr/data/all.py in <module>
          1 from ..utils import *
    ----> 2 from .core import *
          3 from .language_modeling import *
          4 from .question_answering import *
          5 from .token_classification import *
    
    ~/Documentos/github/blurr/blurr/data/core.py in <module>
          6 from functools import reduce
          7 
    ----> 8 import torch, nlp
          9 from transformers import *
         10 from fastai2.text.all import *
    
    ModuleNotFoundError: No module named 'nlp'
    
    opened by tyoc213 6
  • Installing Blurr Breaks Kaggle Notebook Saving

    Installing Blurr Breaks Kaggle Notebook Saving

    Installing blurr in a Kaggle notebook breaks saving the notebook due incompatible nbconvert requirements. Kaggle requires nbconvert-6.1.0, but blurr installs nbconvert-5.6.1.

    This appears to be due to blurr requiring nbverbose as a install requirement. A possible fix on blurr's end could be moving nbverbose to dev_requirements until nbdev has it's dependencies updated.

    The dependency tree, from pipdeptree:

    - ohmeow-blurr==0.1.0 [requires: nbverbose>=0.0.1] 
       - nbverbose==0.0.9 [requires: nbdev<2.0.0]
          - nbdev==1.1.23 [requires: nbconvert<6]
    
    opened by warner-benjamin 5
  • BlearnerForSummarization with t5-base errors out 'function' object has no attribute 'setup'

    BlearnerForSummarization with t5-base errors out 'function' object has no attribute 'setup'

    I tried running the example code in summarization page of doc with 't5-base' model, but it errors out. I have tried using latest release and master of blurr and fastcore but still issue persists. Here's the sample and the error it spits out:

    learn = BlearnerForSummarization.from_data(
        cnndm_df,
        "t5-base",
        text_attr="article",
        summary_attr="highlights",
        max_length=256,
        max_target_length=130,
        dblock_splitter=RandomSplitter(),
        dl_kwargs={"bs": 2},
    ).to_fp16()
    

    The error I get:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Cell In [7], line 1
    ----> 1 learn = BlearnerForSummarization.from_data(
          2     cnndm_df,
          3     "t5-base",
          4     text_attr="article",
          5     summary_attr="highlights",
          6     max_length=256,
          7     max_target_length=130,
          8     dblock_splitter=RandomSplitter(),
          9     dl_kwargs={"bs": 2},
         10 ).to_fp16()
    
    File ~/mambaforge/envs/aranya/lib/python3.8/site-packages/blurr/text/modeling/seq2seq/summarization.py:146, in BlearnerForSummarization.from_data(cls, data, pretrained_model_name_or_path, text_attr, summary_attr, max_length, max_target_length, dblock_splitter, hf_tok_kwargs, text_gen_kwargs, dl_kwargs, learner_kwargs)
        143 get_y = ItemGetter(summary_attr)
        145 if hf_arch == "t5":
    --> 146     get_x.add(cls._add_t5_prefix)
        148 # define our DataBlock and DataLoaders
        149 batch_tokenize_tfm = Seq2SeqBatchTokenizeTransform(
        150     hf_arch,
        151     hf_config,
       (...)
        156     text_gen_kwargs=text_gen_kwargs,
        157 )
    
    File ~/mambaforge/envs/aranya/lib/python3.8/site-packages/fastcore/transform.py:204, in Pipeline.add(self, ts, items, train_setup)
        202 def add(self,ts, items=None, train_setup=False):
        203     if not is_listy(ts): ts=[ts]
    --> 204     for t in ts: t.setup(items, train_setup)
        205     self.fs+=ts
        206     self.fs = self.fs.sorted(key='order')
    
    AttributeError: 'function' object has no attribute 'setup'
    
    opened by SiddharthPant 1
  • Getting issue with dblock = DataBlock(     blocks=blocks,      get_x=ColReader(

    Getting issue with dblock = DataBlock( blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter() )

    Hi ,

    I am getting issue with below block, my code was running fine a week back, I am getting this issue from last 3 days, I think it is due to fastcore update please check it once : dblock = DataBlock( blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter() )

    Below error triggered


    AttributeError Traceback (most recent call last) in 8 get_x=ColReader("text"), 9 get_y=ColReader("label"), ---> 10 splitter=ColSplitter() 11 ) 12

    2 frames /usr/local/lib/python3.7/dist-packages/fastcore/meta.py in _init(self, *args, **kwargs) 148 if isinstance(arg,MethodType): arg = MethodType(arg.func, self) 149 setattr(self, k, arg) --> 150 old_init(self, *args, **kwargs) 151 functools.update_wrapper(_init, old_init) 152 cls.init = use_kwargs(cls._methods)(_init)

    /usr/local/lib/python3.7/dist-packages/fastai/data/block.py in init(self, blocks, dl_type, getters, n_inp, item_tfms, batch_tfms, **kwargs) 97 if getattr(b, 'dl_type', None) is not None: self.dl_type = b.dl_type 98 if dl_type is not None: self.dl_type = dl_type ---> 99 self.dataloaders = delegates(self.dl_type.init)(self.dataloaders) 100 self.dls_kwargs = merge(*blocks.attrgot('dls_kwargs', {})) 101

    /usr/local/lib/python3.7/dist-packages/fastcore/meta.py in _f(f) 123 s2 = {k:v.replace(kind=inspect.Parameter.KEYWORD_ONLY) for k,v in inspect.signature(to_f).parameters.items() 124 if v.default != inspect.Parameter.empty and k not in sigd and k not in but} --> 125 anno = {k:v for k,v in to_f.annotations.items() if k not in sigd and k not in but} 126 sigd.update(s2) 127 if keep: sigd['kwargs'] = k

    AttributeError: 'method-wrapper' object has no attribute 'annotations'

    I am just running the code https://pypi.org/project/ohmeow-blurr/0.0.7/. It is giving the same error: Here is code snip as well:

    import os, warnings

    import torch from transformers import * from transformers.utils import logging as hf_logging from fastai.text.all import *

    from blurr.text.data.all import * from blurr.text.modeling.all import *

    path = untar_data(URLs.IMDB_SAMPLE)

    model_path = Path("models") imdb_df = pd.read_csv(path / "texts.csv")

    n_labels = len(imdb_df["label"].unique())

    model_cls = AutoModelForSequenceClassification

    pretrained_model_name = "bert-base-uncased"

    config = AutoConfig.from_pretrained(pretrained_model_name) config.num_labels = n_labels

    hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects( pretrained_model_name, model_cls=model_cls, config=config )

    single input

    blocks = ( TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model), CategoryBlock ) dblock = DataBlock( blocks=blocks, get_x=ColReader("text"), get_y=ColReader("label"), splitter=ColSplitter() )

    dls = dblock.dataloaders(imdb_df, bs=4)

    Note: dblock is giving error here

    opened by ranjan-sumit 1
  • Problem with the DataBlock creation for MLM

    Problem with the DataBlock creation for MLM

    I've been trying to do some MLModelling with Blurr and transformers from huggingface. When I come to the datablock creation I get an Error.

    imagen

    It is a problem with the DataBlock class from fastai and fastcore delegate decorator.

    From what I understood it is delegating the dataloaders creation, but looks like the delegator is waiting an annotation attribute taht doesn't exist. I am using google colab

    opened by Dmaturana81 4
  • Multi-Label Classification Problem

    Multi-Label Classification Problem

    Hello, I'm new to DL and have just begun the Fastai course for 2022.

    I'm working on a Multi-label classification problem and downloaded the Emotions Dataset from Hugging. As shown below, this Dataset used an Array to show the Multi-label. image

    I've looked at the example in the repository. You have changed the data to this One-hot encoding value. image

    and then used the Colreader to get the Y variable.

    So, my question is: Do I have to use the same structure for my problem as well, or is there a different one I can use?

    opened by satish860 1
  • Out of memory errors with Keras models

    Out of memory errors with Keras models

    Hi, I'm trying to load a keras model saved as h5 by passing model_kwargs={"from_tf": True} to get_hf_objects. This consistently causes CUDA out of memory errors. I've tried reducing batch_size to 1, and downgraded pytorch to 1.9.0. Any ideas what could be going wrong? Thanks in advance.

    Traceback (most recent call last): File "/scratch/gh47/dg5608/pretraining-benefits/src/biobart_cnn_tf.py", line 52, in <module> model = BaseModelWrapper(hf_model) File "/scratch/gh47/dg5608/pt109/lib/python3.9/site-packages/fastcore/meta.py", line 39, in __call__ res.__init__(*args,**kwargs) File "/scratch/gh47/dg5608/pt109/lib/python3.9/site-packages/blurr/text/modeling/core.py", line 58, in __init__ self.hf_model = hf_model.cuda() if torch.cuda.is_available() else hf_model File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 637, in cuda return self._apply(lambda t: t.cuda(device)) File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 530, in _apply module._apply(fn) [Previous line repeated 3 more times] File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 552, in _apply param_applied = fn(param) File "/apps/pytorch/1.9.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 637, in <lambda> return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 31.75 GiB total capacity; 909.13 MiB already allocated; 1.94 MiB free; 920.00 MiB reserved in total by PyTorch)

    opened by dimagalat 0
  • How to save and load the pretrained model to local?

    How to save and load the pretrained model to local?

    I'm working on the text summarization task. How to save and load the pretrained model instead of downloading again and again on google colab/local pc?

    Below is the code i'm working on

    pretrained_model_name = "facebook/bart-large-cnn"
    hf_arch, hf_config, hf_tokenizer, hf_model = get_hf_objects(pretrained_model_name, model_cls=BartForConditionalGeneration)
    
    hf_arch, type(hf_config), type(hf_tokenizer), type(hf_model)
    
    opened by nithinreddyy 0
Releases(1.0.0)
  • 1.0.0(Apr 12, 2022)

    The official v.1 release of ohmeow-blurr

    This is a massive refactoring over the previous iterations of blurr, including namespace modifications that will make it easier for us to add in support for vision, audio, etc... transformers in the future. If you've used any of the previous versions of blurr or the development build we covered in part 2 of the W&B study group, please make sure you read the docs and note the namespace changes.

    To get up to speed with how to use this library, check out the W&B x fastai x Hugging Face study group. The docs are your friend and full of examples as well. I'll be working on updating the other examples floating around the internet as I have time.

    If you have any questions, please use the hf-fastai channel in the fastai discord or github issues. As always, any and all PRs are welcome.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.26(May 4, 2021)

    Checkout the readme for more info.

    This release fixes a couple of issues and also includes a few breaking changes. Make sure you update your version of fastai to >= 2.3.1 and your huggingface transformers to >= 4.5.x

    Source code(tar.gz)
    Source code(zip)
  • 0.0.22(Jan 1, 2021)

    • Updated the Seq2Seq models to use some of the latest huggingface bits like tokenizer.prepare_seq2seq_batch.
    • Separated out the Seq2Seq and Token Classification metrics into metrics-specific callbacks for a better separation of concerns. As a best practice, you should now only use them as fit_one_cycle, etc.. callbacks rather than attach them to your Learner.
    • NEW: Translation are now available in blurr, joining causal language modeling and summarization in our core Seq2Seq stack
    • NEW: Integration of huggingface's Seq2Seq metrics (rouge, bertscore, meteor, bleu, and sacrebleu). Plenty of info on how to set this up in the docs.
    • NEW: Added default_text_gen_kwargs, a method that given a huggingface config, model, and task (optional), will return the default/recommended kwargs for any text generation models.
    • A lot of code cleanup (e.g., refactored naming and removal of redundant code into classes/methods)
    • More model support and more tests across the board! Check out the docs for more info
    • Misc. validation improvements and bug fixes.

    See the docs for each task for more info!

    Source code(tar.gz)
    Source code(zip)
  • 0.0.16(Nov 4, 2020)

  • 0.0.14(Sep 25, 2020)

    This release simplifies the API and introduces a new on-the-fly tokenization feature whereby all tokenization happens during mini-batch creation. There are several upsides to this approach. First, it gets you training faster. Second, it reduces RAM utilization during the reading of your raw data (esp. nice with very large datasets that would give folks problems on platforms like colab). And lastly, I believe the approach provides some flexibility to include data augmentation and/or build adverserial models amongst other things.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.12(Sep 16, 2020)

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚑ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Help you discover excellent English projects and get rid of disturbing by other spoken language

GitHub English Top Charts γ€ŒHelp you discover excellent English projects and get

GrowingGit 544 Jan 09, 2023
A script that automatically creates a branch name using google translation api and jira api

About google translation api와 jira api을 μ‚¬μš©ν•˜μ—¬ μžλ™μœΌλ‘œ 브랜치 이름을 λ§Œλ“€μ–΄μ£ΌλŠ” 슀크립트 Setup ν™˜κ²½λ³€μˆ˜μ— λ‹€μŒ 3가지λ₯Ό 등둝해야 ν•œλ‹€. JIRA_USER : JIRA email (ex: hyunwook.kim 2 Dec 20, 2021

Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

Margarita Humanitarian Foundation 6 Dec 17, 2022
This repository contains (not all) code from my project on Named Entity Recognition in philosophical text

NERphilosophy πŸ‘‹ Welcome to the github repository of my BsC thesis. This repository contains (not all) code from my project on Named Entity Recognitio

Ruben 1 Jan 27, 2022
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
Idea is to build a model which will take keywords as inputs and generate sentences as outputs.

keytotext Idea is to build a model which will take keywords as inputs and generate sentences as outputs. Potential use case can include: Marketing Sea

Gagan Bhatia 364 Jan 03, 2023
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
Installation, test and evaluation of Scribosermo speech-to-text engine

Scribosermo STT Setup Scribosermo is a LGPL licensed, open-source speech recognition engine to "Train fast Speech-to-Text networks in different langua

Florian Quirin 3 Jun 20, 2022
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

Max Woolf 3.1k Jan 07, 2023
kochat

Kochat 챗봇 λΉŒλ”λŠ” 성에 μ•ˆμ°¨κ³ , μžμ‹ λ§Œμ˜ λ”₯λŸ¬λ‹ 챗봇 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ„ λ§Œλ“œμ‹œκ³  μ‹ΆμœΌμ‹ κ°€μš”? Kochat을 μ΄μš©ν•˜λ©΄ μ†μ‰½κ²Œ μžμ‹ λ§Œμ˜ λ”₯λŸ¬λ‹ 챗봇 μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ„ λΉŒλ“œν•  수 μžˆμŠ΅λ‹ˆλ‹€. # 1. 데이터셋 객체 생성 dataset = Dataset(ood=True) #

1 Oct 25, 2021
Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

Damian Panek 176 Nov 28, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular intervals.It sends out the most recent news at random!

Nepali-news-notifier This script just scrapes the most recent Nepali news from Kathmandu Post and notifies the user about current events at regular in

Sachit Yadav 1 Feb 11, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
ChessCoach is a neural network-based chess engine capable of natural-language commentary.

ChessCoach is a neural network-based chess engine capable of natural-language commentary.

Chris Butner 380 Dec 03, 2022
Spert NLP Relation Extraction API deployed with torchserve for inference

URLMask Python program for Linux users to change a URL to ANY domain. A program than can take any url and mask it to any domain name you like. E.g. ne

Zichu Chen 1 Nov 24, 2021
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022