Quickstart

Installation

To install the package from the source repository, execute the following command:

pip install git+https://gitlab.ifisc.uib-csic.es/socio-physics/querier.git

You can then import the module to check it was properly installed:

import querier as qr

Credentials file

A credentials file is required to access any database (next to the database name). The database administrator should provide your username and password (among other required parameters).

A credentials file is a config file (CFG file format) used by Querier to access the databases. It contains two types of sections: sources, which defines where the data should be retrieved from and specific databases.

An example of a credentials file is shown below:

[mongodb]
host=mongo0.ifisc.lan,mongo1.ifisc.lan
port=27017

[twitter]
suffixes=_2014,_2015,_2016
type=mongodb
ruser=<twitter_username>
rpwd=<twitter_password>

In this example, the source is MongoDB server (section [mongodb]) and the specific database is [twitter]. Notice that the [twitter] section is a [mongodb] database (defined by the field ‘type’).

For example, we could append another specific database for the same source:

[flightradar]
type=mongodb
ruser=<flightradar_username>
rpwd=<flightradar_password>

Note

The default location of the credentials file is your home directory (~/.credentials.cfg). If the file’s name starts with a point (‘.’) character it becomes hidden to file explorers (option ‘-a’ of ls lists hidden files in a directory)

Note

Databases sections can have an optional field called ‘suffixes’ (see [twitter] in the example above). They define several databases that are similar. In the previous example, the section [twitter] grants the user access to databases:

  • twitter_2014

  • twitter_2015

  • twitter_2016

When using any extraction method from querier.Connection

For more details, see querier.Connection.

Connect to a MongoDB database

A querier.Connection object is required to retrieve data from a database. To create it, a credentials file and a database name are required. The list of databases you are allowed to access will be provided by the database administrator.

To start a new connection there are two ways:

  • querier.Connection supports the python’s ‘with’ keyword. It should be prioritized as it will close the connection automatically:

    import querier as qr
    with qr.Connection('twitter_2020') as con:
        # Use con
    
  • It can be instantiated and then closed manually using querier.Connection.close():

    import querier as qr
    con = qr.Connection('twitter_2020')
    # Use con
    con.close()
    

Both examples create an object called con of type querier.Connection, use it to extract data and then close it.

The constructor starts a process to connect to the database . This process can be resolved instantaneously or, at most, in 30 seconds. If the connection process was successful the Connection object can be used to extract data from the database. Otherwise an appropriate exception will be raised. (see Custom errors)

Extract from a collection

Each database may contain several collections. To extract data from a specific collection, you can select it with square brackets:

with qr.Connection('twitter_2020') as con:
    result = con['collection_name'].extract(...)

which calls the querier.CollectionsAccessor.extract() method, equivalent to providing collections_subset=’collection_name’ to the querier.Connection.extract() method.

To know what collections are available in the database, you can use the querier.Connection.list_available_collections() method:

with qr.Connection('twitter_2020') as con:
    print(con.list_available_collections())

Database format

The entries in a MongoDB database are stored in a similar format to python dictionaries. Each entry is a collection of fields with an associated value (which can be a simple or composed type or even another dictionary). Here’s an example of an entry from the twitter database:

{
    'created_at': datetime.datetime(2020, 1, 4, 13, 49, 59),
    'favorite_count': 0,
    'favorited': False,
    'lang': 'es',
    'place': {'attributes': {},
       'bounding_box': {'coordinates': [[[-109.479171, -56.557358],
                                         [-109.479171, -17.497384],
                                         [-66.15203, -17.497384],
                                         [-66.15203, -56.557358]]],
                        'type': 'Polygon'},
       'country': 'Chile',
       'country_code': 'CL',
       'full_name': 'Chile',
       'id': '47a3cf27863714de',
       'name': 'Chile',
       'place_type': 'country',
       'url': 'https://api.twitter.com/1.1/geo/id/47a3cf27863714de.json'},

    . . .
}

Entries are returned by querier as python dictionaries. You can access a field by its name:

>>> tweet['created_at']
datetime.datetime(2020, 1, 4, 13, 49, 59)

>>> tweet['place']['bounding_box']
{
    'coordinates': [[[-109.479171, -56.557358],
                    [-109.479171, -17.497384],
                    [-66.15203, -17.497384],
                    [-66.15203, -56.557358]]],
    'type': 'Polygon'
}

The different operations to extract entries from the database are documented and explained in querier.Connection

Creating a filter

To retrieve data from a database a querier.Filter is required. They are used to retrieve entries with special conditions.

The most simple filter is the empty filter:

import querier as qr
f = qr.Filter()

It will make querier.Connection.extract() method to return all entries in the database as no condition is defined in the filter.

Filter methods can be used (see querier.Filter) to add simple conditions that test a particular field from the database.

Example of a filter:

import querier as qr
f = qr.Filter()
f.greater_than('retweet_count', 500)
f.less_than('retweet_count', 1000)
f.any_of('place.country_code', ['ES', 'FR'])

This filter will only allow tweets (entries) from Spain or France with a number of retweets between 500 and 1000.

Note

To identify nested fields, the dot notation (‘.’) can be used. In the previous example a condition is added to the field ‘place.country_code’. It refers to the field country_code which is subfield from the field named place.

See Examples to get several code snippets that use querier to extract data. The full list of classes and methods are documented in Querier API