Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.

Overview

Travis CI

Archivematica

By Artefactual

Archivematica is a web- and standards-based, open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content. Our target users are archivists, librarians, and anyone working to preserve digital objects.

You are free to copy, modify, and distribute Archivematica with attribution under the terms of the AGPLv3 license. See the LICENSE file for details.

Installation

Other resources

  • Website: User and administrator documentation
  • Wiki: Developer facing documentation, requirements analysis and community resources
  • Issues: Git repository used for tracking Archivematica issues and feature/enhancement ideas
  • User Google Group: Forum/mailing list for user questions (both technical and end-user)
  • Paid support: Paid support, hosting, training, consulting and software development contracts from Artefactual

Contributing

Thank you for your interest in Archivematica! For more details, see the contributing guidelines

Reporting an issue

Issues related to Archivematica, the Storage Service, or any related repository can be filed in the Archivematica Issues repository.

Security

If you have a security concern about Archivematica or any related repository, please see the SECURITY file for information about how to safely report vulnerabilities.

Related projects

Archivematica consists of several projects working together, including:

  • Archivematica: This repository! Main repository containing the user-facing dashboard, task manager MCPServer and clients scripts for the MCPClient
  • Storage Service: Responsible for moving files to Archivematica for processing, and from Archivematica into long-term storage
  • Format Policy Registry: Submodule shared between Archivematica and the Format Policy Registry (FPR) server that displays and updates FPR rules and commands

For more projects in the Archivematica ecosystem, see the getting started page.

Comments
  • Problem: using symlinks breaks Windows dev environments

    Problem: using symlinks breaks Windows dev environments

    Archivematica cannot be deployed on Windows, but this PR from @minusdavid https://github.com/artefactual/deploy-pub/pull/39 makes it possible to deploy a development environment on Windows, using vagrant to deploy to a linux vm.

    That PR is working great, but there is a problem with checking out a git repo that contains symlinks into a windows filesystem (google it, lots of links). Windows doesn't properly support symlinks, and so checking out a repo with symlinks is difficult, ansible roles choke, you get weird git errors, etc.

    In this repo, there are only a few symlinks being used - it would not be hard to remove them altogether. I think the only place left is in the osdeps folders. Removing those symlinks and creating duplicate files for now would allow osdeps to differ for each platform, which is fine, and would make developing in a Windows environment much easier, which is a bonus.

    opened by jhsimpson 24
  • Problem: Extract contents crashes due to UnicodeEncodeError

    Problem: Extract contents crashes due to UnicodeEncodeError

    We have come across a transfer where the "Extract contents from compressed archives" job seems to run fine, until it comes across a new compressed object where it fails with the following message in the task overview in the dashboard:

    ....
    Not extracting contents from Cotu_K.doc  - No rule found to extract
    Not extracting contents from UMT_24.02.12.pdf  - No rule found to extract
    Not extracting contents from Rapport_d_activite_Alen.niger_version2.doc  - No rule found to extract
    Not extracting contents from Wawan_10.04.17.pdf  - No rule found to extract
    
    extractContents.py: INFO      2018-05-17 20:56:32,786  archivematica.mcp.client.extractContents:get_dir_uuids:240:  Assigning UUID d425717b-eadf-45fc-b5d7-ab13cf550682 to directory path %transferDirectory%objects/SEMINAIRES_2010/MID_TERM_EVALUATION/revised_ghislaine.zip-2018-05-17T20:54:14.968039+00:00/
    Traceback (most recent call last):
      File "/usr/lib/archivematica/MCPClient/clientScripts/extractContents.py", line 188, in <module>
        sys.exit(main(transfer_uuid, sip_directory, date, task_uuid, delete=delete))
      File "/usr/lib/archivematica/MCPClient/clientScripts/extractContents.py", line 164, in main
        transfer_mdl)
      File "/usr/share/archivematica/dashboard/main/models.py", line 502, in create_many
        for dir_path, dir_uuid in dir_paths_uuids])
      File "/usr/lib/archivematica/archivematicaCommon/archivematicaFunctions.py", line 237, in get_dir_uuids
        dir_uuid, dir_path)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\x82' in position 119: ordinal not in range(128)
    

    The script crashes here, with an exit code of 1 according to the task overview dashboard. In the Transfer dashboard however, the job is incorrectly marked as 'Completed successfully'.

    Furthermore, it seems to have skipped the jobs 'Sanitize extracted objects' file and directory names', 'Scan for viruses on extracted files', etc. that would normally run after extraction of packages. Instead it simply moves forward to the 'Update METS.xml document' as if no packages to extract were found.

    This finally results in a 'real' error during METS creation during ingest:

    Traceback (most recent call last):
      File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1314, in <module>
        baseDirectoryPath, objectsDirectoryPath, directories)
      File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1182, in get_normative_structmap
        add_normative_structmap_div(all_fsitems, normativeStructMapDiv, directories)
      File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaCreateMETS2.py", line 1220, in add_normative_structmap_div
        LABEL=basename)
      File "src/lxml/lxml.etree.pyx", line 3112, in lxml.etree.SubElement (src/lxml/lxml.etree.c:81786)
      File "src/lxml/apihelpers.pxi", line 203, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:18358)
      File "src/lxml/apihelpers.pxi", line 198, in lxml.etree._makeSubElement (src/lxml/lxml.etree.c:18281)
      File "src/lxml/apihelpers.pxi", line 302, in lxml.etree._initNodeAttributes (src/lxml/lxml.etree.c:19840)
      File "src/lxml/apihelpers.pxi", line 316, in lxml.etree._addAttributeToNode (src/lxml/lxml.etree.c:20196)
      File "src/lxml/apihelpers.pxi", line 1439, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32441)
    ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
    

    Further investigation into this error reveals it failed to insert the filename of an extracted file into a normative structmap, due to the fact that the file was of course never normalized after extraction.

    IISH 
    opened by kerim1 22
  • MCPClient error check Gearman worker creation

    MCPClient error check Gearman worker creation

    This patch adds a try/except block to the MCPClient when creating a Gearman worker in startThread().

    Without this patch, if the MCPClient configuration item "MCPArchivematicaServer" has an invalid value, no Gearman worker will be created and Archivematica will be stuck thinking that a job is executing indefinitely with no indication of what happened in the user interface or the logs.

    To test, open "/etc/archivematica/MCPClient/clientConfig.conf", and change "MCPArchivematicaServer" to something invalid like "buffalo" or "localhost::9999", and then try to do a standard transfer in the Archivematica dashboard UI. In the micro-service "Verify transfer compliance", you'll get stuck at "Job: Set file permissions". It will say it's still executing but the job will never actually run.

    opened by minusdavid 21
  • Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks

    Rework the MCP Server, MCP Client and MCP Client scripts to support batching tasks

    • The MCP Server now batches file-level tasks into fixed-size groups, creating one Gearman task per batch, rather than one per file. It also uses fixed-size thread pools to limit contention between threads.

    • The MCP Client now operates in batches, processing one batch at a time. It also supports running tasks using a pool of processes (improving throughput where tasks benefit from spanning multiple CPUs.)

    • The MCP Client scripts now accept a batch of jobs and process them as a single unit. There is a new Job API that provides a standard interface for these client scripts, and all scripts have been converted to use this.

    The motivation for this work was to improve performance on transfer and ingest workflows, and to provide an improved interface for implementing client scripts.

    Our testing shows transfers and ingests taking approximately half the time they did without these changes.

    These changes also permit further optimisation of client scripts, by taking advantage of processing files in batches rather than one at a time. We did some work on optimising a few of the client scripts, but there is likely more improvement to be gained by further optimisation.


    This is connected to #938.

    Jisc RDSS 
    opened by jambun 19
  • Problem: Consistent Ingest failure with media/video transfer

    Problem: Consistent Ingest failure with media/video transfer

    This package of data here is causing Ingest to fail in Archivematica 1.7:

    To recreate:

    • Untar the data and begin transfer
    • Transfer will complete but will hang on Normalize -> Validate Preservation Derivatives job

    If we look at the MCP Server Log we see a large chunk of MediaConch output, followed by a SQL failure:

    2[1]\\" formatid=\\"0xBF\\">0x434FD850</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"300\\" context=\\"/Segment[1]/Info[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xBAB0729C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"387\\" context=\\"/Segment[1]/Tracks[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xAD6CDF0C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"657\\" context=\\"/Segment[1]/Tags[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x634624A1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"336713942\\" context=\\"/Segment[1]/Cues[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x96E4D111</value>\\n        </test>\\n      </check>\\n      <check icid=\\"EBML-CRC-VALID\\" version=\\"1\\" tests_run=\\"5\\" fail_count=\\"0\\" pass_count=\\"5\\">\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"65\\" context=\\"/Segment[1]/SeekHead[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x434FD850</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"300\\" context=\\"/Segment[1]/Info[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xBAB0729C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"387\\" context=\\"/Segment[1]/Tracks[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0xAD6CDF0C</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"657\\" context=\\"/Segment[1]/Tags[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x634624A1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"CRC-32\\" offset=\\"336713942\\" context=\\"/Segment[1]/Cues[1]/CRC-32[1]\\" formatid=\\"0xBF\\">0x96E4D111</value>\\n        </test>\\n      </check>\\n      <check icid=\\"MKV-VALID-TRACKTYPE-VALUE\\" version=\\"1\\" tests_run=\\"2\\" fail_count=\\"0\\" pass_count=\\"2\\">\\n        <context name=\\"Valid Values\\">1 2 3 16 17 18 32</context>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"TrackType\\" offset=\\"419\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[1]/TrackType[1]\\" formatid=\\"0x83\\">1</value>\\n          <value offset=\\"419\\" name=\\"TrackType\\">1</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"TrackType\\" offset=\\"616\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[2]/TrackType[1]\\" formatid=\\"0x83\\">2</value>\\n          <value offset=\\"616\\" name=\\"TrackType\\">2</value>\\n        </test>\\n      </check>\\n      <check icid=\\"MKV-VALID-BOOLEANS\\" version=\\"1\\" tests_run=\\"2\\" fail_count=\\"0\\" pass_count=\\"2\\">\\n        <context name=\\"Valid Values\\">0 1</context>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"FlagLacing\\" offset=\\"409\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[1]/FlagLacing[1]\\" formatid=\\"0x9C\\">0</value>\\n          <value offset=\\"409\\" name=\\"FlagLacing\\">0</value>\\n        </test>\\n        <test outcome=\\"pass\\">\\n          <value name=\\"FlagLacing\\" offset=\\"591\\" context=\\"/Segment[1]/Tracks[1]/TrackEntry[2]/FlagLacing[1]\\" formatid=\\"0x9C\\">0</value>\\n          <value offset=\\"591\\" name=\\"FlagLacing\\">0</value>\\n        </test>\\n      </check>\\n    </implementationChecks>\\n    <implementationChecks checks_run=\\"0\\" fail_count=\\"0\\" pass_count=\\"0\\">\\n      <name>MediaConch FFV1 Implementation Checker</name>\\n    </implementationChecks>\\n    <implementationChecks checks_run=\\"1\\" fail_count=\\"0\\" pass_count=\\"1\\">\\n      <name>MediaConch PCM Implementation Checker</name>\\n      <check icid=\\"PCM-IS-CBR\\" version=\\"1\\" tests_run=\\"1\\" fail_count=\\"0\\" pass_count=\\"1\\">\\n        <context name=\\"Valid Values\\">CBR</context>\\n        <test outcome=\\"pass\\">\\n          <value offset=\\"\\" name=\\"\\">CBR</value>\\n        </test>\\n      </check>\\n    </implementationChecks>\\n  </media>\\n</MediaConch>\\n\\r\\n\\n", "eventOutcomeDetailNote": "MediaConch implementation check result: The implementation check MediaConch EBML Implementation Checker returned failure for the following check(s): EBML-ELEMENT-VALID-PARENT."}\n\n'}
    archivematica-mcp-server_1       | ERROR     2018-03-09 03:06:01  archivematica.mcp.server:utils:wrapped:16:  Uncaught exception
    archivematica-mcp-server_1       | Traceback (most recent call last):
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/utils.py", line 14, in wrapped
    archivematica-mcp-server_1       |     return fn(*args, **kwargs)
    archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 47, in wrapper
    archivematica-mcp-server_1       |     return f(*args, **kwargs)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 91, in performTask
    archivematica-mcp-server_1       |     self.check_request_status(completed_job_request)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 100, in check_request_status
    archivematica-mcp-server_1       |     self.linkTaskManager.taskCompletedCallBackFunction(self)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/linkTaskManagerFiles.py", line 143, in taskCompletedCallBackFunction
    archivematica-mcp-server_1       |     databaseFunctions.logTaskCompletedSQL(task)
    archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 263, in logTaskCompletedSQL
    archivematica-mcp-server_1       |     task.save()
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
    archivematica-mcp-server_1       |     force_update=force_update, update_fields=update_fields)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
    archivematica-mcp-server_1       |     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 827, in _save_table
    archivematica-mcp-server_1       |     forced_update)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 877, in _do_update
    archivematica-mcp-server_1       |     return filtered._update(values) > 0
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 580, in _update
    archivematica-mcp-server_1       |     return query.get_compiler(self.db).execute_sql(CURSOR)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1062, in execute_sql
    archivematica-mcp-server_1       |     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    archivematica-mcp-server_1       |     cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
    archivematica-mcp-server_1       |     six.reraise(dj_exc_type, dj_exc_value, traceback)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(query, args)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute
    archivematica-mcp-server_1       |     self.errorhandler(self, exc, value)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    archivematica-mcp-server_1       |     raise errorvalue
    archivematica-mcp-server_1       | OperationalError: (2006, 'MySQL server has gone away')
    archivematica-mcp-server_1       | Exception in thread Thread-1105:
    archivematica-mcp-server_1       | Traceback (most recent call last):
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    archivematica-mcp-server_1       |     self.run()
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/threading.py", line 754, in run
    archivematica-mcp-server_1       |     self.__target(*self.__args, **self.__kwargs)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/utils.py", line 14, in wrapped
    archivematica-mcp-server_1       |     return fn(*args, **kwargs)
    archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 47, in wrapper
    archivematica-mcp-server_1       |     return f(*args, **kwargs)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 91, in performTask
    archivematica-mcp-server_1       |     self.check_request_status(completed_job_request)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/taskStandard.py", line 100, in check_request_status
    archivematica-mcp-server_1       |     self.linkTaskManager.taskCompletedCallBackFunction(self)
    archivematica-mcp-server_1       |   File "/src/MCPServer/lib/linkTaskManagerFiles.py", line 143, in taskCompletedCallBackFunction
    archivematica-mcp-server_1       |     databaseFunctions.logTaskCompletedSQL(task)
    archivematica-mcp-server_1       |   File "/src/archivematicaCommon/lib/databaseFunctions.py", line 263, in logTaskCompletedSQL
    archivematica-mcp-server_1       |     task.save()
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 734, in save
    archivematica-mcp-server_1       |     force_update=force_update, update_fields=update_fields)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 762, in save_base
    archivematica-mcp-server_1       |     updated = self._save_table(raw, cls, force_insert, force_update, using, update_fields)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 827, in _save_table
    archivematica-mcp-server_1       |     forced_update)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/base.py", line 877, in _do_update
    archivematica-mcp-server_1       |     return filtered._update(values) > 0
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 580, in _update
    archivematica-mcp-server_1       |     return query.get_compiler(self.db).execute_sql(CURSOR)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 1062, in execute_sql
    archivematica-mcp-server_1       |     cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    archivematica-mcp-server_1       |     cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 98, in __exit__
    archivematica-mcp-server_1       |     six.reraise(dj_exc_type, dj_exc_value, traceback)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/utils.py", line 64, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(sql, params)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/django/db/backends/mysql/base.py", line 124, in execute
    archivematica-mcp-server_1       |     return self.cursor.execute(query, args)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/cursors.py", line 226, in execute
    archivematica-mcp-server_1       |     self.errorhandler(self, exc, value)
    archivematica-mcp-server_1       |   File "/usr/local/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    archivematica-mcp-server_1       |     raise errorvalue
    archivematica-mcp-server_1       | OperationalError: (2006, 'MySQL server has gone away')
    

    This has been seen with other transfer material, but wasn't measured and recreated under the same control circumstances as here.

    Will investigate more as I get an opportunity.

    Type: bug 
    opened by ross-spencer 19
  • Issues with installing Archivematica 1.8 RPMs/Debs on Fresh Servers

    Issues with installing Archivematica 1.8 RPMs/Debs on Fresh Servers

    Using the documentation located here, I encountered the following issues with the CentOS and Ubuntu packages posted to last meeting's agenda:

    CentOS 7

    At step 3 of the instructions in the documentation, I got the following error:

    [[email protected] ~]$ sudo -u root yum install -y java-1.8.0-openjdk-headless mariadb-server gearmand
    Loaded plugins: fastestmirror
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    Trying other mirror.
    https://jenkins-ci.archivematica.org/1.8.x/centos/repodata/repomd.xml: [Errno 14] curl#7 - "Failed connect to jenkins-ci.archivematica.org:443; Connection refused"
    

    This was just due to a difference in the URL in the documentation vs the URL from last meeting. I updated the yum repo to use the correct URL and received the following:

    [[email protected] ~]$ sudo -u root yum install -y python-pip archivematica-storage-service
    Loaded plugins: fastestmirror
    Loading mirror speeds from cached hostfile
     * base: mirrors.usc.edu
     * epel: mirrors.kernel.org
     * extras: mirror.web-ster.com
     * updates: mirror.keystealth.org
    No package archivematica-storage-service available.
    

    So I was unable to proceed past step 3 in the instructions since the storage service wasn't available. I did try the steps afterwards just to see how far I could get without the storage service and got up to the following (step 5):

    [[email protected] ~]$ sudo -u archivematica bash -c " \
    > set -a -e -x
    > source /etc/sysconfig/archivematica-dashboard
    > cd /usr/share/archivematica/dashboard
    > /usr/lib/python2.7/archivematica/dashboard/bin/python manage.py syncdb --noinput
    > ";
    + source /etc/sysconfig/archivematica-dashboard
    ++ ARCHIVEMATICA_DASHBOARD_DASHBOARD_DJANGO_SECRET_KEY=Ptpucrhu0doIq2QcHZtcO9caaqE11fk2
    ++ ARCHIVEMATICA_DASHBOARD_DASHBOARD_DJANGO_ALLOWED_HOSTS='*'
    ++ AM_GUNICORN_BIND=127.0.0.1:7400
    ++ DJANGO_SETTINGS_MODULE=settings.production
    ++ ARCHIVEMATICA_DASHBOARD_DB_NAME=MCP
    ++ ARCHIVEMATICA_DASHBOARD_DB_USER=archivematica
    ++ ARCHIVEMATICA_DASHBOARD_DB_PASSWORD=demo
    ++ ARCHIVEMATICA_DASHBOARD_DB_HOST=localhost
    ++ ARCHIVEMATICA_DASHBOARD_DB_PORT=3306
    ++ ARCHIVEMATICA_DASHBOARD_GEARMAN=localhost:4730
    ++ ARCHIVEMATICA_DASHBOARD_ELASTICSEARCH=localhost:9200
    ++ PYTHONPATH=/usr/lib/archivematica/archivematicaCommon/:/usr/share/archivematica/dashboard
    + cd /usr/share/archivematica/dashboard
    + /usr/lib/python2.7/archivematica/dashboard/bin/python manage.py syncdb --noinput
    bash: line 3: /usr/lib/python2.7/archivematica/dashboard/bin/python: No such file or directory
    

    Ubuntu 16.04

    I got up to step 3 on Ubuntu. When I ran apt-get update I received the following:

    Reading package lists... Done
    W: The repository 'http://jenkins-ci.archivematica.org/1.8.x/ubuntu xenial Release' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.
    W: http://packages.archivematica.org/1.6.x/ubuntu-externals/dists/trusty/InRelease: Signature by key 486650CDD6355E25DA542E06C8F04D025236CA08 uses weak digest algorithm (SHA1)
    E: Failed to fetch http://jenkins-ci.archivematica.org/1.8.x/ubuntu/dists/xenial/main/binary-amd64/Packages  404  Not Found
    E: Some index files failed to download. They have been ignored, or old ones used instead.
    

    Again there was a slight gap between the URLs from our last meeting and the documentation. When I switched the URL to use what was in our meeting notes I was met with:

    Reading package lists... Done
    W: The repository 'http://jenkins-ci.archivematica.org/repos/apt/dev-1.8.x-xenial xenial Release' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.
    W: http://packages.archivematica.org/1.6.x/ubuntu-externals/dists/trusty/InRelease: Signature by key 486650CDD6355E25DA542E06C8F04D025236CA08 uses weak digest algorithm (SHA1)
    E: Failed to fetch http://jenkins-ci.archivematica.org/repos/apt/dev-1.8.x-xenial/dists/xenial/main/binary-amd64/Packages  404  Not Found
    E: Some index files failed to download. They have been ignored, or old ones used instead.
    

    This looks like it might just be due to how the repo is structured, since the current release has dist and architecture-specific subdirs.

    Status: in progress Columbia University Library CUL: phase 1 
    opened by jpellman 18
  • Problem: ES client timeout is not configurable

    Problem: ES client timeout is not configurable

    With big METs files, 10 seconds might not be enough. This creates a configuration parameter in order to configure it.

    I went for the conservative approach of only changing the ES aip index call, but this can also be handled at connection level, with something like:

    es_client = Elasticsearch(**{
        'hosts': _es_hosts,
        'timeout': request_timeout,
        'dead_timeout': 2,
    })
    

    Refs: #10734

    opened by scollazo 18
  • Problem: Parse Dataverse Mets fails for some datasets

    Problem: Parse Dataverse Mets fails for some datasets

    Testing the new "Parse Dataverse Mets" job within the "Parse External Files" microservice.

    When testing with datasets I added to Dataverse, this job completes successfully.

    When testing datasests created by the Scholar's Portal team, this job is failing. screen shot 2018-06-20 at 10 07 14 am

    The error message from the task that doesn't complete is: screen shot 2018-06-20 at 10 07 41 am

    I am not sure if it is relevant, but to get to access this dataset, I used "data&subtree=archivematica" in the relative path of the Dataverse location. For the datasets that have passed this job, that field is only set to "archivematica".

    OCUL: AM-Dataverse 
    opened by joel-simpson 17
  • Problem: antivirus scanning errors if file is too big

    Problem: antivirus scanning errors if file is too big

    Some defaults in clamd.conf are causing the antivirus scanning client script to fail. I believe these are the attributes involved:

    ■ MaxScanSize SIZE
    Sets the maximum amount of data to be scanned for each input file. Archives and other containers are recursively extracted and scanned up to this value. Warning: disabling this limit or setting it too high may result in severe damage to the system.
    Default: 100M
    
    ■ MaxFileSize SIZE
    Files larger than this limit won't be scanned. Affects the input file itself as well as files contained inside it (when the input file is an archive, a document or some other kind of container). Warning: disabling this limit or setting it too high may result in severe damage to the system.
    Default: 25M
    
    ■ StreamMaxLength SIZE
    Clamd uses FTP-like protocol to receive data from remote clients. If you are using clamav-milter to balance load between remote clamd daemons on firewall servers you may need to tune the Stream* options. This option allows you to specify the upper limit for data size that will be transfered to remote daemon when scanning a single file. It should match your MTA's limit for a maximum attachment size.
    Default: 10M
    

    I tried to change their values to:

    StreamMaxLength 0
    MaxScanSize 0
    MaxFileSize 0
    

    Zero means unlimited. However the value of StreamMaxLength seems to be hard-coded to 4G (read this).

    With the previous config, I've made the following tests:

    image

    The largest one (4.1G) failed.

    The output of the client script is:

    archivematicaClamscan.py: ERROR     2017-10-11 22:05:26,861  archivematica.mcp.client.clamscan:main:95:  Unexpected error scanning: /var/archivematica/sharedDirectory/currentlyProcessing/Test-4.1G-File-With-Accents_2-31a95a67-a0e0-4546-815d-fdbeebd41da6/objects/Volcán.jpg
    Traceback (most recent call last):
      File "/usr/lib/archivematica/MCPClient/clientScripts/archivematicaClamscan.py", line 93, in main
        result = client.instream(open(target))
      File "/usr/share/python/archivematica-mcp-client/local/lib/python2.7/site-packages/clamd/__init__.py", line 190, in instream
        self.clamd_socket.send(size + chunk)
    error: [Errno 104] Connection reset by peer
    archivematicaClamscan.py: INFO      2017-10-11 22:05:26,867  archivematica.mcp.client.clamscan:record_event:56:  Recording event for fileUUID=29ffd407-35c6-41bf-ae5a-30dee0589962 outcome=Fail
    

    In /var/log/clamav/clamav.log:

    WARNING: INSTREAM: Size limit reached, (requested: 1024, max: 1023)
    
    Type: bug Severity: critical 
    opened by sevein 17
  • Problem: index_aip crashes elasticsearch for large transfers

    Problem: index_aip crashes elasticsearch for large transfers

    While testing large and multiple transfers for the rate limiting investigation, we noticed elasticsearch crash when it hit max-memory. Transfers with many files produced larger JSON documents (50 to 100MB), and the post to elasticsearch would take longer than the 10 second timeout causing a retry soon after. As these retries pile up, elasticsearch quickly hits its memory limit and barfs.

    We tried increasing the elasticsearch memory allocation to 3x the default and still hit the limit. However, we think we can avoid this situation by increasing the default timeout from 10 seconds to 5 minutes during AIP indexing. This will lesson the load the elasticsearch server (by avoiding all those retries) and allow time for those large documents to be indexed.

    We'll test this out and prepare a PR.

    Jisc RDSS Piql NHA 
    opened by payten 16
  • Problem: quarantine delay is not reliable

    Problem: quarantine delay is not reliable

    When a transfer is sent to quarantine the user will be prompted to remove it from quarantine manually. But also, by default, the transfer is removed from quarantined automatically after 28 days. This is a delay that can be configured by the user in the processing configuration.

    The purpose of the delay is to allow virus definitions to update, before virus scan.

    There are two modules in MCP implementing the processing delay (1, 2).

    It's done with a timer from the threading module which doesn't seem to be provide real guarantees. What would happen if the process is interrupted before the timer finishes? AMQP or Redis seem to offer primitives that allow implementing scheduling. Gearman doesn't seem to provide any.

    Solution 1

    Deprecate quarantine delay functionality. Only the user would be able to remove it from quarantine.

    Solution 2

    Update virus definitions before antivirus checking?

    Solution 3

    Implement delayed jobs using Redis or similar. Once a new scheduled job comes in, MCP would persist it somewhere. In a loop, the tasks would be polled frequently and throw new jobs when needed. The following is a library that could be used for this purpose or as a reference: https://github.com/josiahcarlson/rpqueue/ (it uses Python + Redis).

    Remember that Redis is already used as a Gearman backend, we've tried this successfully in our local Docker developer setup. Adding Redis to our stack would be also beneficial for other purposes like caching in Django.

    Severity: high 
    opened by sevein 16
  • Truncates filenames if they exceed os limit

    Truncates filenames if they exceed os limit

    Gets the maximum filename length and truncates the name of the renamed file if it exceeds the max allowable by the underlying OS. Is one path to addressing Archivematica/Issues#1586

    opened by helrond 0
  • Prefer QuerySet.exists() to QuerySet.count() > 0

    Prefer QuerySet.exists() to QuerySet.count() > 0

    This is a micro-optimisation I found while looking at some database issues in the clamscan script; the Django documentation suggests using this instead of count() > 0.

    See https://docs.djangoproject.com/en/4.1/ref/models/querysets/#django.db.models.query.QuerySet.exists

    This is for https://github.com/archivematica/Issues/issues/1578 and https://github.com/wellcomecollection/archivematica-infrastructure/issues/101

    opened by alexwlchan 0
  • Micro-optimisations from the Wellcome fork

    Micro-optimisations from the Wellcome fork

    Part of https://github.com/archivematica/Issues/issues/1578

    I've been reviewing the changes in our fork of artefactual/archivematica. After removing changes that are already merged (OIDC/zipped bag) or changes which won't be merged (Wellcome-specific bits), this is what was left. It's not especially substantial, but seemed a shame to let it go to waste.

    opened by alexwlchan 0
  • Make ES limits configurable

    Make ES limits configurable

    The total AIPs limit on the "Archival Storage" and "Appraisal" tabs were limited to 10000 items. It was hardcoded so this pull request makes this limit configurable.

    Connects to https://github.com/archivematica/Issues/issues/1571

    opened by mamedin 1
  • Add ability to customize LDAP Attributes

    Add ability to customize LDAP Attributes

    This allows users to override the attributes for first name, last name, and email.

    I tested this locally, and it works, and would be great to not have to overwrite core files on the system.

    Thanks!

    Related to https://github.com/archivematica/Issues/issues/1565

    opened by misilot 0
Releases(v1.13.2)
🦉Data Version Control | Git for Data & Models

Website • Docs • Blog • Twitter • Chat (Community & Support) • Tutorial • Mailing List Data Version Control or DVC is an open-source tool for data sci

Iterative 10.9k Jan 05, 2023
Open source platform for the machine learning lifecycle

MLflow: A Machine Learning Lifecycle Platform MLflow is a platform to streamline machine learning development, including tracking experiments, packagi

MLflow 13.3k Jan 04, 2023
:mag: Ambar: Document Search Engine

🔍 Ambar: Document Search Engine Ambar is an open-source document search engine with automated crawling, OCR, tagging and instant full-text search. Am

RD17 1.9k Jan 09, 2023
Find duplicate files

dupeGuru dupeGuru is a cross-platform (Linux, OS X, Windows) GUI tool to find duplicate files in a system. It is written mostly in Python 3 and has th

Andrew Senetar 3.3k Jan 04, 2023
Small and highly customizable twin-panel file manager for Linux with support for plugins.

Note: Prefered repository hosting is GitLab. If you don't have an account there and don't wish to make one interacting with one on GitHub is fine. Sun

Mladen Mijatov 407 Dec 29, 2022
:bookmark: Browser-independent bookmark manager

buku buku in action! Introduction buku is a powerful bookmark manager written in Python3 and SQLite3. When I started writing it, I couldn't find a fle

Mischievous Meerkat 5.4k Jan 02, 2023
ProPublica's collaborative tip-gathering framework. Import and manage CSV, Google Sheets and Screendoor data with ease.

Collaborate This is a web application for managing and building stories based on tips solicited from the public. This project is meant to be easy to s

ProPublica 86 Oct 18, 2022
The official source code repository for the calibre ebook manager

calibre calibre is an e-book manager. It can view, convert, edit and catalog e-books in all of the major e-book formats. It can also talk to e-book re

Kovid Goyal 14.1k Dec 27, 2022
Wikidata scholarly profiles

Scholia is a python package and webapp for interaction with scholarly information in Wikidata. Webapp As a webapp, it currently runs from Wikimedia To

Finn Årup Nielsen 181 Jan 03, 2023
Scan, index, and archive all of your paper documents

[ en | de | el ] Important news about the future of this project It's been more than 5 years since I started this project on a whim as an effort to tr

Paperless 7.8k Jan 06, 2023
Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic.

Automatic Video Library Manager for TV Shows. It watches for new episodes of your favorite shows, and when they are posted it does its magic. Exclusiv

pyMedusa 1.5k Dec 30, 2022
One webpage for every book ever published!

Open Library Open Library is an open, editable library catalog, building towards a web page for every book ever published. Are you looking to get star

Internet Archive 4k Jan 08, 2023
Main repository of the zim desktop wiki project

Zim - A Desktop Wiki Editor Zim is a graphical text editor used to maintain a collection of wiki pages. Each page can contain links to other pages, si

Zim Desktop Wiki 1.6k Dec 30, 2022
RedNotebook is a cross-platform journal

RedNotebook RedNotebook is a modern desktop journal. It lets you format, tag and search your entries. You can also add pictures, links and customizabl

Jendrik Seipp 417 Dec 28, 2022
The open-source core of Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim through format.

The open-source core of Pinry, a tiling image board system for people who want to save, tag, and share images, videos and webpages in an easy to skim

Pinry 2.7k Jan 08, 2023
This is your launchpad that comes with a variety of applications waiting to run on your kubernetes cluster with a single click

This is your launchpad that comes with a variety of applications waiting to run on your kubernetes cluster with a single click.

M. Rehan 2 Jun 26, 2022
Conference planning tool: CfP, scheduling, speaker management

pretalx is a conference planning tool focused on providing the best experience for organisers, speakers, reviewers, and attendees alike. It handles th

492 Dec 28, 2022
SENAITE Meta Package

SENAITE LIMS Meta Installation Package What does SENAITE mean? SENAITE is a beautiful trigonal, oil-green to greenish black crystal, with almost the h

SENAITE 135 Dec 14, 2022
A :baby: buddy to help caregivers track sleep, feedings, diaper changes, and tummy time to learn about and predict baby's needs without (as much) guess work.

Baby Buddy A buddy for babies! Helps caregivers track sleep, feedings, diaper changes, tummy time and more to learn about and predict baby's needs wit

Baby Buddy 1.5k Jan 02, 2023
A CalDAV/CardDAV server

Xandikos is a lightweight yet complete CardDAV/CalDAV server that backs onto a Git repository. Xandikos (Ξανδικός or Ξανθικός) takes its name from the

Jelmer Vernooij 255 Jan 05, 2023