Basic usage

This library, libcifpp, is a generic CIF library with some specific additions to work with mmCIF files. The main focus of this library is to make sure that files read or written are valid. That is, they are syntactically valid and their content is valid with respect to a CIF dictionary, if such a dictionary is available and specified.

Reading a file is as simple as:

#include <cif++.hpp>

cif::file f("/path/to/file.cif");

The file may also be compressed using gzip which is detected automatically.

Writing out the file again is also simple, to write out the terminal you can do:

std::cout << f;

// or
f.save(std::cout);

// or write a compressed file using gzip compression:
f.save("/tmp/f.cif.gz");

CIF files contain one or more datablocks. To print out the names of all datablocks in our file:

for (auto &db : f)
    std::cout << db.name() << '\n';

Most often libcifpp is used to read in structure files in mmCIF format. These files only contain one datablock and so you can safely use code like this:

// get a reference to the first datablock in f
auto &db = f.front();

But if you know the name of the datablock, this also works:

// get a reference to the datablock name '1CBS'
auto &db = f["1CBS"];

Now, each datablock contains categories. To print out all their names:

for (auto &cat : db)
    std::cout << cat.name() << '\n';

But you probably know what category you need to use, so lets fetch it by name:

// get a reference to the atom_site category in db
auto &atom_site = db["atom_site"];

// and make sure there's some data in it:
assert(not atom_site.empty());

Note

Note that we omit the leading underscore in the name of the category here.

Categories contain rows of data and each row has fields or items. Referencing a row in a category results in a cif::row_handle object which you can use to request or manipulate item data.

// Get the first row in atom_site
auto rh = atom_site.front();

// Get the label_atom_id value from this row handle as a std::string
std::string atom_id = rh["label_atom_id"].as<std::string>();

// Get the x, y and z coordinates using structered binding
const auto &[x, y, z] = rh.get<float,float,float>("Cartn_x", "Cartn_y", "Cartn_z");

// Assign a new value to the x coordinate or our atom
rh["Cartn_x"] = x + 1;

Querying

Walking over the rows in a category is often not very useful. More often you are interested in specific rows in a category. The function cif::category::find() and friends are here to help.

What these functions have in common is that they return data based on a query implemented by cif::condition. These condition objects are built in code using regular C++ syntax. The most basic example of a query is:

cif::condition c = cif::key("id") == 1;

Here the condition is that all rows returned should have a value of 1 in there item named id. Likewise you can use other data types and even combine those. Oh, and I said we use regular C++ syntax for conditions, so you may as well use other operators to compare values:

// condition for C-alpha atoms having an occupancy less than 1.0
cif::condition c = cif::key("occupancy") < 1.0f and cif::key("label_atom_id") == "CA";

Using the namespace cif::literals that code becomes a little less verbose:

using namespace cif::literals;
cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";

Conditions can also be combined:

cif::condition c = "occupancy"_key < 1.0f and "label_atom_id"_key == "CA";

// extend the condition by requiring the compound ID to be unequal to PRO
c = std::move(c) and "label_comp_id"_key != "PRO";

Note

Note the use of std::move here.

Using queries constructed in this way is simple:

cif::condition c = ...
auto result = atom_site.find(std::move(c));

// or construct a condition inline:
auto result = atom_site.find("label_atom_id"_key == "CA");

In the example above the result is a range of cif::row_handle objects. Often, using individual field values is more useful:

// Requesting a single item:
for (auto id : atom_site.find<std::string>("label_atom_id"_key == "CA", "id"))
    std::cout << "ID for CA: " << id << '\n';

// Requesting multiple items:
for (const auto &[id, x, y, z] : atom_site.find<std::string,float,float,float>("label_atom_id"_key == "CA",
        "id", "Cartn_x", "Cartn_y", "Cartn_z"))
{
    std::cout << "Atom " << id << " is at [" << x << ", " << y << ", " z << "]\n";
}

Returning a complete set if often not required, if you only want to have the first you can use cif::category::find_first() as shown here:

// return the ID item for the first C-alpha atom
std::string v1 = atom_site.find_first<std::string>("label_atom_id"_key == "CA", "id");

// If you're not sure the row exists, use std::optional
auto v2 = atom_site.find_first<std::optional<std::string>>("label_atom_id"_key == "CA", "id");
if (v2.has_value())
    ...

There are cases when you really need exactly one result. The cif::category::find1() can be used in that case, it will throw an exception if the query does not result in exactly one row.

NULL and ANY

Sometimes items may be empty. The trouble is a bit that empty comes in two flavors: unknown and null. Null in CIF parlance means the item should not contain a value since it makes no sense in this case, the value stored in the file is a single dot character: '.'. E.g. atom_site records may have a NULL value for label_seq_id for atoms that are part of a non-polymer.

The other empty value is indicated by a question mark character: '?'. This means the value is simply unknown.

Both these are NULL in libcifpp conditions and can be searched for using cif::null.

So you can search for:

cif::condition c = "label_seq_id"_key == cif::null;

You might also want to look for a certain value and don’t care in which item it is stored, in that case you can use cif::any.

cif::condition c = cif::any == "foo";

And in linked record you might have the items that have a value in both parent and child or both should be NULL. For that, you can request the value to return by find to be of type std::optional and then use that value to build the query. An example to explain this, let’s find the location of the atom that is referenced as the first atom in a struct_conn record:

// Take references to the two categories we need
auto struct_conn = db["struct_conn"];
auto atom_site = db["atom_site"];

// Loop over all rows in struct_conn taking only the values we need
// Note that the label_seq_id is returned as a std::optional<int>
// That means it may contain an integer or may be empty
for (const auto &[asym1, seqid1, authseqid1, atomid1] :
    struct_conn.rows<std::string,std::optional<int>,std::string,std::string,std::string>(
        "ptnr1_label_asym_id", "ptnr1_label_seq_id", "ptnr1_auth_seq_id", "ptnr1_label_atom_id"
    ))
{
    // Find the location of the first atom
    cif::point p1 = atom_site.find1<float,float,float>(
        "label_asym_id"_key == asym1 and "label_seq_id"_key == seqid1 and "auth_seq_id"_key == authseqid1 and "label_atom_id"_key == atomid1,
        "cartn_x", "cartn_y", "cartn_z");
}

Validation

CIF files can have a dictionary attached. And based on such a dictionary a cif::validator object can be constructed which in turn can be used to validate the content of the file.

A simple case:

#include <cif++.hpp>

cif::file f("1cbs.cif.gz");
f.load_dictionary("mmcif_pdbx");

if (not f.is_valid())
    std::cout << "This file is not valid\n";

If you want to know why it is not valid, you should set the global variable cif::VERBOSE to something higer than zero. Depending on the value more or less diagnostic output is sent to std::cerr.

In the case above we load a dictionary based on its name. You can of course also load dictionaries based on a specific file, that’s a bit more work:

std::filesystem::ifstream dictFile("/tmp/my-dictionary.dic");
auto &validator = cif::parse_dictionary("my-dictionary", dictFile);

cif::file f("1cbs.cif.gz");

// assign the validator
f.set_validator(&validator);

// alternatively, load it by name
f.load_dictionary("my-dictionary");

if (not f.is_valid())
    std::cout << "This file is not valid\n";

Creating your own dictionary is a lot of work, especially if you are only extending an existing dictionary with a couple of new categories or items. So, what you can do is extend a loaded validator like this (code taken from DSSP):

// db is a cif::datablock reference containing an mmCIF file with DSSP annotations
auto &validator = const_cast<cif::validator &>(*db.get_validator());
if (validator.get_validator_for_category("dssp_struct_summary") == nullptr)
{
    auto dssp_extension = cif::load_resource("dssp-extension.dic");
    if (dssp_extension)
        cif::extend_dictionary(validator, *dssp_extension);
}

Note

In the example above we’re loading the data using Resources. See the documentation on that for more information.

If a validator has been assigned to a file, assignments to items are checked for valid data. So the following code will throw an exception (see: _atom_site-label):

auto rh = atom_site.front();
rh["Cartn_x"] = "foo";

Linking

Based on information recorded in dictionary files (see Validation) you can locate linked records in parent or child categories.

To make this example not too complex, lets assume the following example file:

data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap  Aap
2 noot Noot
3 mies Mies

loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap  1 'Een dier'
2 aap  2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'

And we have a dictionary containing the following link definition:

loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_1 1 '_cat_1.name' '_cat_2.name' cat_2

So, there are links between cat_1 and cat_2 based on the value in items named name. Using this information, we can now locate children and parents:

// Assuming the file was loaded in f:
auto &cat1 = f.front()["cat_1"];
auto &cat2 = f.front()["cat_2"];
auto &cat3 = f.front()["cat_3"];

// Loop over all ape's in cat2
for (auto r : cat1.get_children(cat1.find1("name"_key == "aap"), cat2))
    std::cout << r.get<std::string>("desc") << '\n';

Updating a value in an item in a parent category will update the corresponding value in all related children:

auto r1 = cat1.find1("id"_key == 1);
r1["name"] = "aapje";

auto rs1 = cat2.find("name"_key == "aapje");
assert(rs1.size() == 2);

However, changing a value in a child record will not update the parent. This may result in an invalid file since you may then have a child that has no parent:

auto r2 = cat2.find1("id"_key == 3);
r2["name"] = "wim";

assert(f.is_valid() == false);

So you have to fix this yourself by inserting a new item in cat1 with the new value.

Another situation is when you change a value in a parent and updating children might introduce a situation where you need to split a child. To give an example, consider this:

data_test
loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aap  Aap
2 noot Noot
3 mies Mies

loop_
_cat_2.id
_cat_2.name
_cat_2.num
_cat_2.desc
1 aap  1 'Een dier'
2 aap  2 'Een andere aap'
3 noot 1 'walnoot bijvoorbeeld'

loop_
_cat_3.id
_cat_3.name
_cat_3.num
1 aap 1
2 aap 2

And we have a dictionary containing the following link definition (reversed compared to the previous example):

loop_
_pdbx_item_linked_group_list.parent_category_id
_pdbx_item_linked_group_list.link_group_id
_pdbx_item_linked_group_list.parent_name
_pdbx_item_linked_group_list.child_name
_pdbx_item_linked_group_list.child_category_id
cat_2 1 '_cat_2.name' '_cat_1.name' cat_1
cat_3 1 '_cat_3.name' '_cat_2.name' cat_2
cat_3 1 '_cat_3.num'  '_cat_2.num'  cat_2

So cat3 is a parent of cat2 and cat2 is a parent of cat1. Now, if you change the name value of the first row of cat3 to ‘aapje’, the corresponding row in cat2 is updated as well. But when you update cat2 you have to update cat1 too. And simply changing the name field in row 1 of cat1 is wrong. The default behaviour in libcifpp is to split the record in cat1 and have a new child with the new name whereas the other remains as is.

The new cat1 will thus be like:

loop_
_cat_1.id
_cat_1.name
_cat_1.desc
1 aapje Aap
2 noot  Noot
3 mies  Mies
5 aap   Aap