# NAME

Archive::BagIt - The main module to handle bags.

# VERSION

version 0.086

# NAME

Achive::BagIt - The main module to handle Bags

# SOURCE

The original development version was on github at [http://github.com/rjeschmi/Archive-BagIt](http://github.com/rjeschmi/Archive-BagIt)
and may be cloned from there.

The actual development version is available at [https://git.fsfe.org/art1pirat/Archive-BagIt](https://git.fsfe.org/art1pirat/Archive-BagIt)

# Conformance to RFC8493

The module should fulfill the RFC requirements, with following limitations:

- only encoding UTF-8 is supported
- version 0.97 or 1.0 allowed
- version 0.97 requires tag-/manifest-files with md5-fixity
- version 1.0 requires tag-/manifest-files with sha512-fixity
- BOM is not supported
- Carriage Return in bagit-files are not allowed
- fetch.txt is unsupported

At the moment only filepaths in linux-style are supported.

To get an more detailled overview, see the testsuite under `t/verify_bag.t` and corresponding test bags from the BagIt conformance testsuite of Library of Congress under `bagit_conformance_suite/`.

See [https://datatracker.ietf.org/doc/rfc8493/?include\_text=1](https://datatracker.ietf.org/doc/rfc8493/?include_text=1) for details.

# TODO

- enhanced testsuite
- reduce complexity
- use modern perl code
- add flag to enable very strict verify

# FAQ

## How to access the manifest-entries directly?

Try this:

    foreach my $algorithm ( keys %{ $self->manifests }) {
        my $entries_ref = $self->manifests->{$algorithm}->manifest_entries();
        # $entries_ref returns a hashref like:
        # {
        #     data/hello.txt   "e7c22b994c59d9cf2b48e549b1e24666636045930d3da7c1acb299d1c3b7f931f94aae41edda2c2b207a36e10f8bcb8d45223e54878f5b316e7ce3b6bc019629"
        # }
    }

Similar for tagmanifests

## How fast is [Archive::BagIt](https://metacpan.org/pod/Archive%3A%3ABagIt)?

I have made great efforts to optimize Archive::BagIt for high throughput. There are two limiting factors:

- calculation of checksums, by switching from the module "Digest" to OpenSSL by using [Net::SSLeay](https://metacpan.org/pod/Net%3A%3ASSLeay) a significant
   speed increase could be achieved.
- loading the files referenced in the manifest files was previously done serially and using synchronous I/O. By
   using the [IO::Async](https://metacpan.org/pod/IO%3A%3AAsync) module, the files are loaded asynchronously and the checksums are calculated in parallel.
   If the underlying file system supports parallel accesses, the performance gain is huge.

On my system with 8cores, SSD and a large 9GB bag with 568 payload files the results for `verify_bag()` are:

                     processing time          run time             throughput
    Version       user time    system time    total time    total    MB/s
     v0.71        38.31s        1.60s         39.938s       100%     230
     v0.81        25.48s        1.68s         27.1s          67%     340
     v0.82        48.85s        3.89s          6.84s         17%    1346

## How fast is [Archive::BagIt::Fast](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3AFast)?

It depends. On my system with 8cores, SSD and a 38MB bag with 48 payload files the results for `verify_bag()` are:

                   Rate         Base         Fast
    Base         3.01/s           --         -21%
    Fast         3.80/s          26%           --

On my system with 8cores, SSD and a large 9GB bag with 568 payload files the results for `verify_bag()` are:

                 s/iter         Base         Fast
    Base           74.6           --          -9%
    Fast           68.3           9%           --

But you should measure which variant is best for you. In general the default [Archive::BagIt](https://metacpan.org/pod/Archive%3A%3ABagIt) is fast enough.

## How to update an old bag of version v0.97 to v1.0?

You could try this:

    use Archive::BagIt;
    my $bag=Archive::BagIt->new( $my_old_bag_filepath );
    $bag->load();
    $bag->store();

## How to create UTF-8 based paths under MS Windows?

For versions < Windows10: I have no idea and suggestions for a portable solution are very welcome!
For Windows 10: Thanks to [https://superuser.com/questions/1033088/is-it-possible-to-set-locale-of-a-windows-application-to-utf-8/1451686#1451686](https://superuser.com/questions/1033088/is-it-possible-to-set-locale-of-a-windows-application-to-utf-8/1451686#1451686)
you have to enable UTF-8 support via 'System Administration' -> 'Region' -> 'Administrative'
\-> 'Region Settings' -> Flag 'Use Unicode UTF-8 for worldwide language support'

Hint: The better way is to use only portable filenames. See [perlport](https://metacpan.org/pod/perlport) for details.

# BUGS

There are problems related to Parallel::parallel\_map and IO::AIO under MS Windows. The tests are skipped there. Use the
 parallel feature or the [Archive::BagIt::Fast](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3AFast) at your own risks on a MS Window System.
 If you are a MS Windows developer, feel free to send me patches or hints to fix the issues.

# THANKS

Thanks to Rob Schmidt <rjeschmi@gmail.com> for the trustful handover of the project and thanks for your initial work!
I would also like to thank Patrick Hochstenbach and Rusell McOrmond for their valuable and especially detailed advice!
And without the helpful, sometimes rude help of the IRC channel #perl I would have been stuck in a lot of problems.
Without the support of my colleagues at SLUB Dresden, the project would never have made it this far.

# SYNOPSIS

This modules will hopefully help with the basic commands needed to create
and verify a bag. This part supports BagIt 1.0 according to RFC 8493 (\[https://tools.ietf.org/html/rfc8493\](https://tools.ietf.org/html/rfc8493)).

You only need to know the following methods first:

## read a BagIt

    use Archive::BagIt;

    #read in an existing bag:
    my $bag_dir = "/path/to/bag";
    my $bag = Archive::BagIt->new($bag_dir);

## construct a BagIt around a payload

    use Archive::BagIt;
    my $bag2 = Archive::BagIt->make_bag($bag_dir);

## verify a BagIt-dir

    use Archive::BagIt;

    # Validate a BagIt archive against its manifest
    my $bag3 = Archive::BagIt->new($bag_dir);
    my $is_valid1 = $bag3->verify_bag();

    # Validate a BagIt archive against its manifest, report all errors
    my $bag4 = Archive::BagIt->new($bag_dir);
    my $is_valid2 = $bag4->verify_bag( {report_all_errors => 1} );

## read a BagIt-dir, change something, store

Because all methods operate lazy, you should ensure to parse parts of the bag \*BEFORE\* you modify it.
Otherwise it will be overwritten!

    use Archive::BagIt;
    my $bag5 = Archive::BagIt->new($bag_dir); # lazy, nothing happened
    $bag5->load(); # this updates the object representation by parsing the given $bag_dir
    $bag5->store(); # this writes the bag new

# METHODS

## Constructor

The constructor sub, will create a bag with a single argument,

    use Archive::BagIt;

    #read in an existing bag:
    my $bag_dir = "/path/to/bag";
    my $bag = Archive::BagIt->new($bag_dir);

or use hashreferences

    use Archive::BagIt;

    #read in an existing bag:
    my $bag_dir = "/path/to/bag";
    my $bag = Archive::BagIt->new(
        bag_path => $bag_dir,
    );

The arguments are:

- `bag_path` - path to bag-directory
- `force_utf8` - if set the warnings about non portable filenames are disabled (default: enabled)
- `use_async` - if set it uses IO::Async to read payload files asynchronly, only useful under Linux.
- `use_parallel` - if set it uses Parallel::parallel\_map to calculate digests of payload files in parallel,
      only useful if underlying filesystem supports parallel read and if multiple CPU cores available.

The bag object will use $bag\_dir, BUT an existing $bag\_dir is not read. If you use `store()` an existing bag will be overwritten!

See `load()` if you want to parse/modify an existing bag.

## use\_parallel()

if set it uses parallel digest processing, default: false

## use\_async()

if set it uses async IO, default: false

## has\_force\_utf8()

to check if force\_utf8() was set.

If set it ignores warnings about potential filepath problems.

## bag\_path(\[$new\_value\])

Getter/setter for bag path

## metadata\_path()

Getter for metadata path

## payload\_path()

Getter for payload path

## checksum\_algos()

Getter for registered Checksums

## bag\_version()

Getter for bag version

## bag\_encoding()

Getter for bag encoding.

HINT: the current version of Archive::BagIt only supports UTF-8, but the method could return other values depending on given Bags.

## bag\_info(\[$new\_value\])

Getter/Setter for bag info. Expects/returns an array of HashRefs implementing simple key-value pairs.

HINT: RFC8493 does not allow \*reordering\* of entries!

## has\_bag\_info()

returns true if bag info exists.

## errors()

Getter to return collected errors after a `verify_bag()` call with Option `report_all_errors`

## warnings()

Getter to return collected warnings after a `verify_bag()` call

## digest\_callback()

This method could be reimplemented by derived classes to handle fixity checks in own way. The
getter returns an anonymous function with following interface:

    my $digest = $self->digest_callback;
    &$digest( $digestobject, $filename);

This anonymous function MUST use the `get_hash_string()` function of the [Archive::BagIt::Role::Algorithm](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3ARole%3A%3AAlgorithm) role,
which is implemented by each [Archive::BagIt::Plugin::Algorithm::XXXX](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3APlugin%3A%3AAlgorithm%3A%3AXXXX) module.

See [Archive::BagIt::Fast](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3AFast) for details.

## get\_baginfo\_values\_by\_key($searchkey)

Returns all values which match $searchkey, undef otherwise

## is\_baginfo\_key\_reserved\_as\_uniq($searchkey)

returns true if key is reserved and should be uniq

## is\_baginfo\_key\_reserved( $searchkey )

returns true if key is reserved

## verify\_baginfo()

checks baginfo-keys, returns true if all fine, otherwise returns undef and the message is pushed to `errors()`.
Warnings pushed to ` warnings() `

## delete\_baginfo\_by\_key( $searchkey )

deletes an entry of given $searchkey if exists.
If multiple entries with $searchkey exists, only the last one is deleted.

## exists\_baginfo\_key( $searchkey )

returns true if a given $searchkey exists

## append\_baginfo\_by\_key($searchkey, $newvalue)

Appends a key value pair to bag\_info.

HINT: check return code if append was successful, because some keys needs to be uniq.

## add\_or\_replace\_baginfo\_by\_key($searchkey, $newvalue)

It replaces the first entry with $newvalue if $searchkey exists, otherwise it appends.

## forced\_fixity\_algorithm()

Getter to return the forced fixity algorithm depending on BagIt version

## manifest\_files()

Getter to find all manifest-files

## tagmanifest\_files()

Getter to find all tagmanifest-files

## payload\_files()

Getter to find all payload-files

## non\_payload\_files()

Getter to find all non payload-files

## plugins()

Getter/setter to algorithm plugins

## manifests()

Getter/Setter to all manifests (objects)

## algos()

Getter/Setter to all registered Algorithms

## load\_plugins

As default SHA512 and MD5 will be loaded and therefore used. If you want to create a bag only with one or a specific
checksum-algorithm, you could use this method to (re-)register it. It expects list of strings with namespace of type:
Archive::BagIt::Plugin::Algorithm::XXX where XXX is your chosen fixity algorithm.

## load()

Triggers loading of an existing bag

## verify\_bag($opts)

A method to verify a bag deeply. If `$opts` is set with `{return_all_errors}` all fixity errors are reported.
The default ist to croak with error message if any error is detected.

HINT: You might also want to check Archive::BagIt::Fast to see a more direct way of accessing files (and thus faster).

## calc\_payload\_oxum()

returns an array with octets and streamcount of payload-dir

## calc\_bagsize()

returns a string with human readable size of paylod

## create\_bagit()

creates a bagit.txt file

## create\_baginfo()

creates a bag-info.txt file

Hint: the entries 'Bagging-Date', 'Bag-Software-Agent', 'Payload-Oxum' and 'Bag-Size' will be automagically set,
existing values in internal bag-info representation will be overwritten!

## store()

store a bagit-obj if bagit directory-structure was already constructed.

## init\_metadata()

A constructor that will just create the metadata directory

This won't make a bag, but it will create the conditions to do that eventually

## make\_bag( $bag\_path )

A constructor that will make and return a bag from a directory,

It expects a preliminary bagit-dir exists.
If there a data directory exists, assume it is already a bag (no checking for invalid files in root)

# AVAILABILITY

The latest version of this module is available from the Comprehensive Perl
Archive Network (CPAN). Visit [http://www.perl.com/CPAN/](http://www.perl.com/CPAN/) to find a CPAN
site near you, or see [https://metacpan.org/module/Archive::BagIt/](https://metacpan.org/module/Archive::BagIt/).

# BUGS AND LIMITATIONS

You can make new bug reports, and view existing ones, through the
web interface at [http://rt.cpan.org](http://rt.cpan.org).

# AUTHOR

Andreas Romeyke <cpan@andreas.romeyke.de>

# COPYRIGHT AND LICENSE

This software is copyright (c) 2021 by Rob Schmidt <rjeschmi@gmail.com>, William Wueppelmann and Andreas Romeyke.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.