Introduction

Information on 3D structures of proteins originally came formatted in PDB files. Although the specification for this format had some real restrictions like a mandatory HEADER and CRYST line, many programs implemented this very poorly often writing out only ATOM records. And users became used to this.

The legacy PDB format has some severe limitations rendering it useless for all but very small protein structures. A new format called mmCIF has been around for decades and now is the default format for the Protein Data Bank.

The software developed in the PDB-REDO project aims at improving 3D models based on original experimental data. For this, the tools need to be able to work with both legacy PDB and mmCIF files. A decision was made to make mmCIF leading internally in all programs and convert legacy PDB directly into mmCIF before processing the data. A robust conversion had to be developed to make this possible since, as noted above, files can come with more or less information making it sometimes needed to do a sequence alignment to find out the exact residue numbers.

And so libcif++ came to life, a library to work with mmCIF files. Work on this library started early 2017 and has developed quite a bit since then. To reduce dependency on other libraries, some functionality was added that is not strictly related to reading and writing mmCIF files but may be useful nonetheless. This is mostly code that is used in 3D calculations and symmetry operations.

Design

The main part of the library is a set of classes that work with mmCIF files. They are:

The cif::file class encapsulates the contents of a mmCIF file. In such a file there are one or more cif::datablock objects and each datablock contains one or more cif::category objects.

Synopsis

Using libcifpp is easy, if you are familiar with modern C++:

// A simple program counting residues with an OXT atom

#include <filesystem>
#include <iostream>

#include <cif++.hpp>

namespace fs = std::filesystem;

int main(int argc, char *argv[])
{
    if (argc != 2)
        exit(1);

    // Read file, can be PDB or mmCIF and can even be compressed with gzip.
    cif::file file = cif::pdb::read(argv[1]);

    if (file.empty())
    {
        std::cerr << "Empty file" << std::endl;
        exit(1);
    }

    // Take the first datablock in the file
    auto &db = file.front();

    // Use the atom_site category
    auto &atom_site = db["atom_site"];

    // Count the atoms with atom-id "OXT"
    auto n = atom_site.count(cif::key("label_atom_id") == "OXT");

    std::cout << "File contains " << atom_site.size() << " atoms of which "
              << n << (n == 1 ? " is" : " are") << " OXT" << std::endl
              << "residues with an OXT are:" << std::endl;

    // Loop over all atoms with atom-id "OXT" and print out some info.
    // That info is extracted using structured binding in C++
    for (const auto &[asym, comp, seqnr] :
            atom_site.find<std::string, std::string, int>(
                cif::key("label_atom_id") == "OXT",
                "label_asym_id", "label_comp_id", "label_seq_id"))
    {
        std::cout << asym << ' ' << comp << ' ' << seqnr << std::endl;
    }

    return 0;
}