A knowledge base construction engine for richly formatted data

Overview

Fonduer

CI-CD Code Climate Codecov ReadTheDocs PyPI PyPIVersion GitHubStars License CodeStyle

Fonduer is a Python package and framework for building knowledge base construction (KBC) applications from richly formatted data.

Note that Fonduer is still actively under development, so feedback and contributions are welcome. Submit bugs in the Issues section or feel free to submit your contributions as a pull request.

Getting Started

Check out our Getting Started Guide to get up and running with Fonduer.

Learning how to use Fonduer

The Fonduer tutorials cover the Fonduer workflow, showing how to extract relations from hardware datasheets and scientific literature.

Reference

Fonduer: Knowledge Base Construction from Richly Formatted Data (blog):

@inproceedings{wu2018fonduer,
  title={Fonduer: Knowledge Base Construction from Richly Formatted Data},
  author={Wu, Sen and Hsiao, Luke and Cheng, Xiao and Hancock, Braden and Rekatsinas, Theodoros and Levis, Philip and R{\'e}, Christopher},
  booktitle={Proceedings of the 2018 International Conference on Management of Data},
  pages={1301--1316},
  year={2018},
  organization={ACM}
}

Acknowledgements

Fonduer leverages the work of Emmental and Snorkel.

Comments
  • Using candidates for prediction (Fonduer Prediction Pipeline)

    Using candidates for prediction (Fonduer Prediction Pipeline)

    Scenario:

    For my use case I have a set of financial documents.

    The entire document set is divided into train,dev and test. The documents are parsed and the mentions and candidates are extracted with some rules.

    The featurized training candidates are used to train a Fonduer Learning model and the model is used to predict on the test candidates, as per the normal fonduer pipeline as demonstrated in the hardware tutorial.

    Problems & Questions

    1. Is the fonduer prediction pipeline production ready? How can we fine tune it to achieve better accuracy? Should the main focus be on the quality of the extracted mentions?

    With my initial analysis and usage following the hardware tutorial, I could not obtain good results.

    1. Can we separate the training and test pipeline?

    As in the current scenario, with a new document that I will feed for prediction, The entire corpus will have to be parsed to extract the mentions and candidates and store the feature keys.

    Please correct me, if that won't be the case and help me with a snippet to showcase the separation.

    opened by atulgupta9 16
  • Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537

    Description of the problems or issues

    Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).

    Does your pull request fix any issue. See #534

    Description of the proposed changes

    In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

    Test plan

    This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    opened by YasushiMiyata 15
  • parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    parser.apply does not return for a long time even though the progress bar indicates it finishes parsing

    Description of the bug

    This is not a bug, but a performance issue. This is not noticeable when parsing a small number of documents, but parser.apply does not return even though the progress bar indicates it finishes parsing a long time ago (1 hour or more ago).

    To Reproduce

    Steps to reproduce the behavior:

    1. Parse many documents (my case: ~2500)

    Expected behavior

    parser.apply returns when the progress bar indicates it finished parsing all the documents.

    Error Logs/Screenshots

    If applicable, add error logs or screenshots to help explain your problem.

    Environment (please complete the following information)

    • OS: Debian Buster
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: N/A
    • Fonduer Version: 0.8.3+dev (01e0d9319b523aff7aa7f5c583a9f330b0705ecc)

    Additional context

    Add any other context about the problem here.

    bug 
    opened by HiromuHota 14
  • Execute preprocessing and parsing in parallel

    Execute preprocessing and parsing in parallel

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    Currently, preprocessor and parser are executed in a complete sequential order. i.e., preprocess N docs (and load them into a queue), then parse N docs. This has two drawbacks:

    1. the progress bar shows nothing during preprocessing.
    2. the machine RAM has to be large enough to hold N preprocessed docs at a time.

    They become more serious when N is large and/or each HTML file is large.

    Does your pull request fix any issue.

    Fix #435

    Description of the proposed changes

    A clear and concise description of what you propose.

    This PR

    • places a cap on the in_queue so that only a certain number of documents are loaded to in_queue.
    • executes preprocessor and parser in parallel (ie the main process does preprocessing and child process(es) do parsing in parallel).

    Test plan

    A clear and concise description of how you test the new changes.

    For the 1st issue: I manually check the progress bar starts showing progress right after starting parse.apply.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 13
  • [Errno 32] Broken pipe for Parser in parallel execution on OSX

    [Errno 32] Broken pipe for Parser in parallel execution on OSX

    Hi,

    In fonduer-tutorials, after running cell:

    corpus_parser = OmniParser(structural=True, lingual=True, visual=True, pdf_path=pdf_path)
    %time corpus_parser.apply(doc_preprocessor, parallelism=PARALLEL)
    

    whenever is PARALLEL smaller than max_docs, I've got:

    Traceback (most recent call last):
      File "/anaconda3/lib/python3.6/multiprocessing/queues.py", line 240, in _feed
        send_bytes(obj)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 200, in send_bytes
        self._send_bytes(m[offset:offset + size])
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 398, in _send_bytes
        self._send(buf)
      File "/anaconda3/lib/python3.6/multiprocessing/connection.py", line 368, in _send
        n = write(self._handle, buf)
    BrokenPipeError: [Errno 32] Broken pipe
    

    Otherwise (with PARALLEL bigger or equal than max_docs) result is empty tables in Postgresql. When turning off parallelisation, it works.

    Best regards

    bug 
    opened by mladvladimir 13
  • Feat/multary candidates

    Feat/multary candidates

    Description of the problems or issues

    The feature extraction only supports unary and binary candidates

    Does your pull request fix any issue. Closes #455

    Description of the proposed changes

    Add new functions that supports multary-relation between spans for the feature extraction

    Test plan

    A clear and concise description of how you test the new changes. Use a candidate with more then two mentions, and try the feature extraction part.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.

    Note:

    In order for this to run the multary-candidates in textual features, we need a new version of treedlib based on this PR: treedlib#46 So if you can contact them, please do.

    Also if someone can jump-in to improve the coverage, I can't get the tabular_features up

    enhancement 
    opened by wajdikhattel 12
  • Add HOCRDocProprocessor and HocrVisualParser

    Add HOCRDocProprocessor and HocrVisualParser

    Description of the problems or issues

    Is your pull request related to a problem? Please describe.

    This is the second patch that follows #518 .

    Does your pull request fix any issue.

    N/A.

    Description of the proposed changes

    Add HOCRDocProprocessor and HocrVisualParser

    Test plan

    I added a few real hOCR example files.

    Checklist

    • [x] I have updated the documentation accordingly.
    • [x] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    • [x] I have updated the CHANGELOG.rst accordingly.
    enhancement 
    opened by HiromuHota 9
  • Duplicate key error while adding two mentions which are same

    Duplicate key error while adding two mentions which are same

    Suppose that I have two mentions (say for example zip-code and tax code) whose matchers return true (checking 5 digit regex match for both mentions) for the same span in document, then I think Fonduer is throwing this error. please help me in resolving this.

    
    sqlalchemy.exc.IntegrityError: (psycopg2.errors.UniqueViolation) duplicate key value violates unique constraint "context_stable_id_key"
    DETAIL:  Key (stable_id)=(1443208965_10_subset::span_mention:23313:23321) already exists.
    
    [SQL: INSERT INTO context (type, stable_id) VALUES (%(type)s, %(stable_id)s) RETURNING context.id]
    
    opened by saikalyan9981 9
  • unable to read images in the pdf file

    unable to read images in the pdf file

    Hi

    I am passing html to fonduer and it is saying unable read image from figure I have taken a pdf converted to html via pdftotree and passing the html to fonduer. Is this the issue with pdftotree that it is not able to render images. I want to what is the mechanism so that we can have images linked/embed in html so that fonduer can read it

    Please help/advice as i am stuck with this issue

    opened by ashleo25 8
  • Non-deterministic behavior in featurization

    Non-deterministic behavior in featurization

    Describe the bug When working with large (~7k docs) corpus of hardware datasheets, extracting multiple relations, we expect that the features for each candidate would be deterministic between each run. Even more so if we have parallelism=1 set in the Featurizer. However, we find that there can be small (e.g., < 5) differences between feature tables, resulting in slightly different sparse matrices, and thus, slightly different results.

    To Reproduce Running on the HACK transistor dataset will reproduce the error. However, it will take a long time, and we haven't been able to get a very minimal example that reproduces the error yet. Attached are two feature table dumps between two different runs with parallelism=1. Note that there is only a single difference on line 65454.

    feature_table.tar.gz

    Note that it isn't always one difference, and the difference is not deterministic. The different attached is just an example.

    Expected behavior We would expect that these feature tables are identical between runs.

    Error Logs/Screenshots For convenience, here is the differing line in screenshot form image

    Additional context If the issue is in the UDF implementation, this might affect the Labeler in addition to the Featurizer, since they share a lot of the UDF code.

    bug 
    opened by lukehsiao 8
  • Type hints

    Type hints

    Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

    I'm always frustrated when I have to look at the source codes to check the type of arguments/return.

    Describe the solution you'd like A clear and concise description of what you want to happen.

    1. Type hints (PEP484) are written to source codes like
    def greeting(name: str) -> str:
        return 'Hello ' + name
    
    1. (Eventually) enforce type checking during pre-commit

    For example by flake8-mypy

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Depending on the editor (PyCharm, etc.), type/rtype documentation like below gives you type hinting. However, I'm not sure this is equivalent to the type hints (PEP484).

    def greeting(name):
        """
        greeting
    
        :param name: description
        :type name: type description
        :return: description
        :rtype: type description
        """
        return 'Hello ' + name
    

    Additional context Add any other context or screenshots about the feature request here.

    enhancement help wanted 
    opened by HiromuHota 8
  • CandidateExtractor doesn't scale for larger relations

    CandidateExtractor doesn't scale for larger relations

    Hello, thanks for providing this framework. My group has run into a bit of a snag:

    For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

    We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

    Here are some numbers from our use case:

    • Mention A: 800 mentions found
    • Mention B: 140 mentions found
    • Mention C: 150 mentions found

    If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

    This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

    opened by robbieculkin 1
  • Tables aren't redefined for re-runs of UDF apply

    Tables aren't redefined for re-runs of UDF apply

    Description of the bug

    As part of iterative development in a Jupyter environment, apply may be re-run several times. The developer might need to update candidates or create a new labeling function, for example. When this happens, the corresponding Postgres table is cleared but not dropped. This means that the definition of the table cannot change to accommodate the updated parameters for apply.

    To Reproduce

    Steps to reproduce the behavior:

    1. Run the max_storage_temp_tutorial notebook in fonduer-tutorials, up to and including the Labeling Functions section.
    2. Add a new LF, doesn't need to do anything in particular (could return ABSTAIN every time). Add this to the stg_temp_lfs list.
    3. Re-run the remainder of cells in the section.

    Upon calling LFAnalysis, the following exception is thrown:

    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Expected behavior

    Underlying tables for a re-run of a UDF apply method should not only be cleared, but dropped.

    Error Logs/Screenshots

    Full stack trace:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-62-e005feee6300> in <module>
          5 sorted_lfs = sorted(lfs, key=lambda lf: lf.name)
          6 
    ----> 7 LFAnalysis(L=L_train[0], lfs=sorted_lfs).lf_summary(Y=L_gold_train[0].reshape(-1))
    
    ~/.venv/lib/python3.7/site-packages/snorkel/labeling/analysis.py in __init__(self, L, lfs)
         44             if len(lfs) != self._L_sparse.shape[1]:
         45                 raise ValueError(
    ---> 46                     f"Number of LFs ({len(lfs)}) and number of "
         47                     f"LF matrix columns ({self._L_sparse.shape[1]}) are different"
         48                 )
    
    ValueError: Number of LFs (7) and number of LF matrix columns (6) are different
    

    Environment (please complete the following information)

    • OS: Ubuntu 18.04
    • PostgreSQL Version: 12.1
    • Poppler Utils Version: 0.71.0-5
    • Fonduer Version: 0.8.3

    Additional context

    https://github.com/HazyResearch/fonduer/issues/263#issuecomment-527588765 advises restarting Python, but this does not appear to solve the problem.

    opened by robbieculkin 5
  • Parser is not splitting multiple lines sentences properly

    Parser is not splitting multiple lines sentences properly

    Description of the bug

    I'm trying to Train a model that can build a Knowledge Base from the OPC UA Companions specification as a part of my Thesis. I have the Dataset as PDFs and used a third-party program to convert them into HTML and tried my best to preserve the data structure information (i'm getting the same result even if i just Parsed on the PDFs alone).

    Then i followed the hardware_fonduer_model Tutorial to Extract the Candidates accordingly.

    the Problem is that the Parser is splitting the sentences wrongly, namely it is getting the end of a Line as an end of a sentence. I tried to debug using a SimpleParser.split_sentences(text) command and turned out that python needs a backslash to split a statement into multiple lines.

    So i thought maybe i could use the replacements=['[\n]', ' '] parameter so the Split could function better but i'm getting the ValueError: too many values to unpack (expected 2). What is the default configuration for the sentence segmentation?
    How could i get a multiple Sentences as a mention? (i tried MentionNgram till n_max =100 and still getting just one).

    I would really appreciate getting back from you.

    many thanks in advance

    Example: Text to be parsed

    Boolean indicating if a profile /signature should be generated by this move command request.If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Expected behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command request. sentence 2 : If the optional VariableSignatureRequestStatus is not provided on the Object, this parameter is ignored by the Server.

    Actual behavior

    sentence 1 : Boolean indicating if a profile /signature should be generated by this move command sentence 2 : request. sentence 3 : request.If the optional VariableSignatureRequestStatus is not provided on the Object, this sentence 4 : parameter is ignored by the Server.

    Environment

    opened by eng-khaled1 3
  • Suggestion required: Getting error while applying Featurizer

    Suggestion required: Getting error while applying Featurizer

    @SenWu @HiromuHota .. can you pls suggest if my analogy is right?

    I am getting error :- File "abcd./anaconda3/lib/python3.7/site-packages/fonduer/utils/data_model_utils/structural.py", line 55, in _get_node return doc_etree.xpath(sentence.xpath)[0] IndexError: list index out of range

    I am following Hardware tutorial on some Email HTML msgs and getting mentions count near 4000

    Also :-- train_cands = candidate_extractor.get_candidates(split=0) dev_cands = candidate_extractor.get_candidates(split=1) test_cands = candidate_extractor.get_candidates(split=2)

    Above steps returned outputs but,

    on applying featurizer: featurizer.apply(split=0, train=True, parallelism=PARALLEL)

    I am getting error mentioned on top.

    I looked stackoverflow but the reason that HTML syntax issue,.. is not there as it is rendering good on browser. So can you share your thoughts on :

    1. can it be because no candidates being generated? or
    2. something else

    Thanks.

    opened by AshutoshUpadhya 3
  • How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document

    How can i extract a paragraph and all associated sentences in document
    Basically i need paragraphs with associated sentences @lukehsiao @SenWu @vincentschen @ZZWENG @stephenbach

    Appreciate your help

    needs-info 
    opened by ashleo25 1
  • Featurizer.get_keys() does not honor candidate classes in context

    Featurizer.get_keys() does not honor candidate classes in context

    Description of the bug

    Unlike other methods (eg Featurizer.drop_keys() and Featurizer.upsert_keys(), Featurizer.get_keys() does not honor candidate classes in context but returns all feature keys no matter which candidate class they are associated with. This is confusing.

    See https://github.com/HazyResearch/fonduer/issues/511#issuecomment-696618392 for how this actually confused a user.

    To Reproduce

    This is a design error.

    Expected behavior

    These methods should behave similarly. Either

    • None of these honor candidate classes, or
    • All of these honor them.

    Error Logs/Screenshots

    N/A

    Environment (please complete the following information)

    • Fonduer Version: 0.8.3

    Additional context

    Add any other context about the problem here.

    opened by HiromuHota 0
Releases(v0.9.0)
  • v0.9.0(Jun 23, 2021)

    0.9.0 - 2021-06-22

    This is a long-awaited release with some performance improvements and some breaking changes. See the changelog for details.

    Added

    Changed

    • @HiromuHota: Renamed VisualLinker to PdfVisualParser, which assumes the followings: (#518)

      • pdf_path should be a directory path, where PDF files exist, and cannot be a file path.
      • The PDF file should have the same basename (os.path.basename) as the document. E.g., the PDF file should be either "123.pdf" or "123.PDF" for "123.html".
    • @HiromuHota: Changed Parser's signature as follows: (#518)

      • Renamed vizlink to visual_parser.
      • Removed pdf_path. Now this is required only by PdfVisualParser.
      • Removed visual. Provide visual_parser if visual information is to be parsed.
    • @YasushiMiyata: Changed UDFRunner's and UDF's data commit process as follows: (#545)

      • Removed add process on single-thread in _apply in UDFRunner.
      • Added UDFRunner._add of y on multi-threads to Parser, Labeler and Featurizer.
      • Removed y of document parsed result from out_queue in UDF.

    Fixed

    Source code(tar.gz)
    Source code(zip)
    fonduer-0.9.0-py3-none-any.whl(146.07 KB)
    fonduer-0.9.0.tar.gz(102.10 KB)
  • v0.8.3(Sep 11, 2020)

    0.8.3 - 2020-09-11

    This is a big release with a lot of changes. These changes are summarized here. Check the Changelog for more details.

    Added

    Changed

    • @YasushiMiyata: Enable RegexMatchSpan with concatenates words by sep="(separator)" option. (#270) (#492)
    • @HiromuHota: Enabled "Type hints (PEP 484) support for the Sphinx autodoc extension." (#421)
    • @HiromuHota: Switched the Cython wrapper for Mecab from mecab-python3 to fugashi. Since the Japanese tokenizer remains the same, there should be no impact on users. (#384) (#432)
    • @HiromuHota: Log a stack trace on parsing error for better debug experience. (#478) (#479)
    • @HiromuHota: get_cell_ngrams and get_neighbor_cell_ngrams yield nothing when the mention is not tabular. (#471) (#504)

    Deprecated

    Fixed

    • @senwu: Fix pdf_path cannot be without a trailing slash. (#442) (#459)
    • @kaikun213: Fix bug in table range difference calculations. (#420)
    • @HiromuHota: mention_extractor.apply with clear=True now works even if it's not the first run. (#424)
    • @HiromuHota: Fix get_horz_ngrams and get_vert_ngrams so that they work even when the input mention is not tabular. (#425) (#426)
    • @HiromuHota: Fix the order of args to Bbox. (#443) (#444)
    • @HiromuHota: Fix the non-deterministic behavior in VisualLinker. (#412) (#458)
    • @HiromuHota: Fix an issue that the progress bar shows no progress on preprocessing by executing preprocessing and parsing in parallel. (#439)
    • @HiromuHota: Adopt to mlflow>=1.9.0. (#461) (#463)
    • @HiromuHota: Correct the entity type for NumberMatcher from "NUMBER" to "CARDINAL". (#473) (#477)
    • @HiromuHota: Fix _get_axis_ngrams not to return None when the input is not tabular. (#481)
    • @HiromuHota: Fix Visualizer.display_candidates not to draw rectangles on wrong pages. (#488)
    • @HiromuHota: Persist doc only when no error happens during parsing. (#489) (#490)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.3-py3-none-any.whl(136.97 KB)
    fonduer-0.8.3.tar.gz(99.00 KB)
  • v0.8.2(Apr 29, 2020)

    0.8.2 - 2020-04-28

    A summary of the changes of this release are below. Check the Changelog for more details.

    Deprecated

    • @HiromuHota: Use of undecorated labeling functions is deprecated and will not be supported as of v0.9.0. Please decorate them with snorkel.labeling.labeling_function.

    Fixed

    • @HiromuHota: Labeling functions can now be decorated with snorkel.labeling.labeling_function. (#400 <https://github.com/HazyResearch/fonduer/issues/400>) (#401 <https://github.com/HazyResearch/fonduer/pull/401>)
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.2-py3-none-any.whl(126.83 KB)
    fonduer-0.8.2.tar.gz(88.07 KB)
  • v0.8.1(Apr 13, 2020)

    0.8.1 - 2020-04-13

    A summary of the changes of this release are below. Check the Changelog for more details.

    Fonduer has a new mode argument to support switching between different learning modes (e.g., STL or MLT).

    Click to see example usage
    # Create task for each relation.
    tasks = create_task(
        task_names = TASK_NAMES,
        n_arities = N_ARITIES,
        n_features = N_FEATURES,
        n_classes = N_CLASSES,
        emb_layer = EMB_LAYER,
        model="LogisticRegression",
        mode = MODE,
    )
    

    Added

    • @senwu: Add mode argument in create_task to support STL and MTL.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.1-py3-none-any.whl(128.52 KB)
    fonduer-0.8.1.tar.gz(87.80 KB)
  • v0.8.0(Apr 8, 2020)

    0.8.0 - 2020-04-07

    A summary of the changes of this release are below. Check the Changelog for more details.

    Rather than maintaining a separate learning engine, we switch to Emmental, a deep learning framework for multi-task learning. Switching to a more general learning framework allows Fonduer to support more applications and multi-task learning.

    Click to see example usage
    # With Emmental, you need do following steps to perform learning:
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    # 2. Wrap candidates into EmmentalDataLoader for training.
    # 3. Training and inference (prediction).
    
    import emmental
    
    # Collect word counter from candidates which is used in LSTM model.
    word_counter = collect_word_counter(train_cands)
    
    # Initialize Emmental. For customize Emmental, please check here:
    # https://emmental.readthedocs.io/en/latest/user/config.html
    emmental.init(fonduer.Meta.log_path)
    
    #######################################################################
    # 1. Create task for each relations and EmmentalModel to learn those tasks.
    #######################################################################
    
    # Generate special tokens which are used for LSTM model to locate mentions.
    # In LSTM model, we pad sentence with special tokens to help LSTM to learn
    # those mentions. Example:
    # Original sentence: Then Barack married Michelle.
    # ->  Then ~~[[1 Barack 1]]~~ married ~~[[2 Michelle 2]]~~.
    arity = 2
    special_tokens = []
    for i in range(arity):
        special_tokens += [f"~~[[{i}", f"{i}]]~~"]
    
    # Generate word embedding module for LSTM.
    emb_layer = EmbeddingModule(
        word_counter=word_counter, word_dim=300, specials=special_tokens
    )
    
    # Create task for each relation.
    tasks = create_task(
        ATTRIBUTE,
        2,
        F_train[0].shape[1],
        2,
        emb_layer,
        mode="mtl",
        model="LogisticRegression",
    )
    
    # Create Emmental model to learn the tasks.
    model = EmmentalModel(name=f"{ATTRIBUTE}_task")
    
    # Add tasks into model
    for task in tasks:
        model.add_task(task)
    
    #######################################################################
    # 2. Wrap candidates into EmmentalDataLoader for training.
    #######################################################################
    
    # Here we only use the samples that have labels, which we filter out the
    # samples that don't have significant marginals.
    diffs = train_marginals.max(axis=1) - train_marginals.min(axis=1)
    train_idxs = np.where(diffs > 1e-6)[0]
    
    # Create a dataloader with weakly supervisied samples to learn the model.
    train_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE,
            train_cands[0],
            F_train[0],
            emb_layer.word2id,
            train_marginals,
            train_idxs,
        ),
        split="train",
        batch_size=100,
        shuffle=True,
    )
    
    
    # Create test dataloader to do prediction.
    # Build test dataloader
    test_dataloader = EmmentalDataLoader(
        task_to_label_dict={ATTRIBUTE: "labels"},
        dataset=FonduerDataset(
            ATTRIBUTE, test_cands[0], F_test[0], emb_layer.word2id, 2
        ),
        split="test",
        batch_size=100,
        shuffle=False,
    )
    
    
    #######################################################################
    # 3. Training and inference (prediction).
    #######################################################################
    
    # Learning those tasks.
    emmental_learner = EmmentalLearner()
    emmental_learner.learn(model, [train_dataloader])
    
    # Predict based the learned model.
    test_preds = model.predict(test_dataloader, return_preds=True)
    

    Changed

    • @senwu: Switch to Emmental as the default learning engine.
    • @HiromuHota: Change ABSTAIN to -1 to be compatible with Snorkel of 0.9.X. Accordingly, user-defined labels should now be 0-indexed (used to be 1-indexed). (#310) (#320)
    • @HiromuHota: Use executemany_mode="batch" instead of deprecated use_batch_mode=True. (#358)
    • @HiromuHota: Use tqdm.notebook.tqdm instead of deprecated tqdm.tqdm_notebook. (#360)
    • @HiromuHota: To support ImageMagick7, expand the version range of Wand. (#373)
    • @HiromuHota: Comply with PEP 561 for type-checking codes that use Fonduer.
    • @HiromuHota: Make UDF.apply of all child classes unaware of the database backend, meaning PostgreSQL is not required if UDF.apply is directly used instead of UDFRunner.apply. (#316) (#368)

    Fixed

    • @senwu: Fix mention extraction to return mention classes instead of data model classes.
    Source code(tar.gz)
    Source code(zip)
    fonduer-0.8.0-py3-none-any.whl(126.29 KB)
    fonduer-0.8.0.tar.gz(87.53 KB)
Owner
HazyResearch
We are a CS research group led by Prof. Chris Ré.
HazyResearch
Post-Training Quantization for Vision transformers.

PTQ4ViT Post-Training Quantization Framework for Vision Transformers. We use the twin uniform quantization method to reduce the quantization error on

Zhihang Yuan 61 Dec 28, 2022
Lucid Sonic Dreams syncs GAN-generated visuals to music.

Lucid Sonic Dreams Lucid Sonic Dreams syncs GAN-generated visuals to music. By default, it uses NVLabs StyleGAN2, with pre-trained models lifted from

731 Jan 02, 2023
Deep Q-Learning Network in pytorch (not actively maintained)

pytoch-dqn This project is pytorch implementation of Human-level control through deep reinforcement learning and I also plan to implement the followin

Hung-Tu Chen 342 Jan 01, 2023
PyTorch implementations of the NeRF model described in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"

PyTorch NeRF and pixelNeRF NeRF: Tiny NeRF: pixelNeRF: This repository contains minimal PyTorch implementations of the NeRF model described in "NeRF:

Michael A. Alcorn 178 Dec 20, 2022
Detection of drones using their thermal signatures from thermal camera through YOLO-V3 based CNN with modifications to encapsulate drone motion

Drone Detection using Thermal Signature This repository highlights the work for night-time drone detection using a using an Optris PI Lightweight ther

Chong Yu Quan 6 Dec 31, 2022
High frequency AI based algorithmic trading module.

Flow Flow is a high frequency algorithmic trading module that uses machine learning to self regulate and self optimize for maximum return. The current

59 Dec 14, 2022
領域を指定し、キーを入力することで画像を保存するツールです。クラス分類用のデータセット作成を想定しています。

image-capture-class-annotation 領域を指定し、キーを入力することで画像を保存するツールです。 クラス分類用のデータセット作成を想定しています。 Requirement OpenCV 3.4.2 or later Usage 実行方法は以下です。 起動後はマウスクリック4

KazuhitoTakahashi 5 May 28, 2021
Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

Realtime Unsupervised Depth Estimation from an Image This is the caffe implementation of our paper "Unsupervised CNN for single view depth estimation:

Ravi Garg 227 Nov 28, 2022
Code to reproduce experiments in the paper "Explainability Requires Interactivity".

Explainability Requires Interactivity This repository contains the code to train all custom models used in the paper Explainability Requires Interacti

Digital Health & Machine Learning 5 Apr 07, 2022
Code for our EMNLP 2021 paper "Learning Kernel-Smoothed Machine Translation with Retrieved Examples"

KSTER Code for our EMNLP 2021 paper "Learning Kernel-Smoothed Machine Translation with Retrieved Examples" [paper]. Usage Download the processed datas

jiangqn 23 Nov 24, 2022
Official pytorch code for SSC-GAN: Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation(ICCV 2021)

SSC-GAN_repo Pytorch implementation for 'Semi-Supervised Single-Stage Controllable GANs for Conditional Fine-Grained Image Generation'.PDF SSC-GAN:Sem

tyty 4 Aug 28, 2022
Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA=10.0,

29 Aug 23, 2022
Official implementation of "Membership Inference Attacks Against Self-supervised Speech Models"

Introduction Official implementation of "Membership Inference Attacks Against Self-supervised Speech Models". In this work, we demonstrate that existi

Wei-Cheng Tseng 7 Nov 01, 2022
A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

TCV-X21 validation for divertor turbulence simulations Quick links Intro Welcome to TCV-X21. We're glad you've found us! This repository is designed t

0 Dec 18, 2021
Pytorch Implementation for (STANet+ and STANet)

Pytorch Implementation for (STANet+ and STANet) V2-Weakly Supervised Visual-Auditory Saliency Detection with Multigranularity Perception (arxiv), pdf:

GuotaoWang 14 Nov 29, 2022
Automatic self-diagnosis program (python required)Automatic self-diagnosis program (python required)

auto-self-checker 자동으로 자가진단 해주는 프로그램(python 필요) 중요 이 프로그램이 실행될때에는 절대로 마우스포인터를 움직이거나 키보드를 건드리면 안된다(화면인식, 마우스포인터로 직접 클릭) 사용법 프로그램을 구동할 폴더 내의 cmd창에서 pip

1 Dec 30, 2021
Pytorch Implementation of the paper "Cross-domain Correspondence Learning for Exemplar-based Image Translation"

CoCosNet Pytorch Implementation of the paper "Cross-domain Correspondence Learning for Exemplar-based Image Translation" (CVPR 2020 oral). Update: 202

Lingbo Yang 38 Sep 22, 2021
Efficient Multi Collection Style Transfer Using GAN

Proposed a new model that can make style transfer from single style image, and allow to transfer into multiple different styles in a single model.

Zhaozheng Shen 2 Jan 15, 2022
🚀 An end-to-end ML applications using PyTorch, W&B, FastAPI, Docker, Streamlit and Heroku

🚀 An end-to-end ML applications using PyTorch, W&B, FastAPI, Docker, Streamlit and Heroku

Made With ML 82 Jun 26, 2022
A way to store images in YAML.

YAMLImg A way to store images in YAML. I made this after seeing Roadcrosser's JSON-G because it was too inspiring to ignore this opportunity. Installa

5 Mar 14, 2022