Connection class

class querier.Connection(dbnamecfg, credentials_path='~/.credentials.cfg')[source]

Establishes a connection with a database and allows data retrieval.

The first parameter (dbnamecfg) must be a valid database name. The database administrator should provide both the credentials and the database names you are allowed to access.

Parameters:

dbnamecfg (str) – The name of the database
credentials_path (str, default "~/.credentials.cfg") – Path to a configuration file (.cfg) with credentials for the database.

Examples

The following snippet shows the most simple way to create a Connection:

import querier as qr

with qr.Connection(dbname) as con:
    # Use con

where dbname is the name of a database.

aggregate(pipeline, collections_subset=None, **aggregate_kwargs)[source]

Extract entries from the database resulting from a processing pipeline.

To limit the number of entries that will be returned, use Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.

To iterate through the entries, see querier.Result.

Parameters:

pipeline (list[dict]) – List of aggregation pipeline stages.
collections_subset (list | None) – List of collections to extract from. A subset of Connection.list_available_collections() or None (all collections).
**aggregate_kwargs – additional keyword arguments to pass on to pymongo.collection.Collection.aggregate()

Return type:

Result

Returns:

Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.

close()[source]: Close the connection.

count_entries(filter=None, collections_subset=None)[source]

Count the entries that matches a filter.

Parameters:

filter (Filter | None) – Filter object to count entries or None (all entries).
collections_subset (list | None) – List of collections to extract from. A subset of Connection.list_available_collections() or None (all collections).

Return type:

int

Returns:

The number of entries in collection matching filter.

Examples

Count how many tweets from Spain in 2014 have more than 500 favorites:

>>> import querier as qr
>>> f = qr.Filter()
>>> f.greater_than('favorite_count', 500)
>>> with qr.Connection('twitter_2014') as con:
>>>     count = con.count_entries(f, collection='spain')
3

distinct(field_name, filter=None, collections_subset=None)[source]

Return a set with all the possible values that the field can take.

Parameters:

field_name (str) – The name of the field to test.
filter (Filter | None) – Filter to test the entries.
collections_subset (list | None) – List of collections to extract from. A subset of Connection.list_available_collections() or None (all collections).

Return type:

set

Returns:

Set of distinct values.

Examples

>>> import querier as qr
>>> with qr.Connection('twitter_2020') as con:
>>>     con.distinct('place.country')
{'Spain', 'France', 'Portugal', 'Germany', ...}

extract(filter=None, fields=None, collections_subset=None)[source]

Extract entries from the database that matches a filter.

To limit the number of entries that will be returned, use Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.

To iterate through the entries, see querier.Result

Parameters:

filter (Filter | None) – Filter to test the entries.
fields (list | None) – List of selected fields from the original database that the result dictionaries will have. This is useful when only a subset of the fields is needed.
collections_subset (list | None) – List of collections to extract from. A subset of Connection.list_available_collections() or None (all collections).

Return type:

Result

Returns:

Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.

extract_one(filter=None, collections_subset=None)[source]

Extract an entry from the database that matches a filter.

Parameters:

filter (Filter | None) – Filter used to test the entries or None (all entries)
collections_subset (list | None) – List of collections to extract from. A subset of Connection.list_available_collections() or None (all collections).

Return type:

dict | None

Returns:

Entry from the database or None if no entry matches the filter.

groupby(field_name, collection, pre_filter=None, post_filter=None, **aggregate_kwargs)[source]

Group by a given field.

Initialize an aggregation pipeline in the collections given by collections_subset (the default None meaning all available collections), in which we filter according to a pre_filter, group by the field field_name, and then filter according to a post_filter. The aggregations done in the groupby stage are specified by a subsequent call to MongoGroupBy.agg().

Parameters:

field_name – Name or list of names of the field(s) by which to group. Note that nested fields using the dot notation will appear in the output with dots (‘.’) replaced by underscores (‘_’).
collection – Name of the collection to perform the aggregation from.
pre_filter – Filter to apply before the aggregation.
post_filter – Filter to apply after the aggregation.
silence_warning – If True, silence the warning about doing an aggregation on multiple collections.
**aggregate_kwargs – additional keyword arguments to pass on to pymongo.collection.Collection.aggregate().

Returns:

A MongoGroupBy instance enabling aggregation by field_name.

list_available_collections()[source]

Return a list of available collections.

MongoDB databases can be split by collections. For example, all tweets from USA are in collection ‘northern_america’ in twitter databases. Contact the database administrator to know how/if collections are semantically split.

Extraction methods can be sped up by using a subset of collections.

Return type:: list
Returns:: A list of all the available collection names.

class querier.CollectionsAccessor(connection, collections_subset=None)[source]

Provides access to one or several collections of a database for data retrieval.

An instance of this class is normally obtained through a Connection instance, selecting with square brackets the collection(s), as one would select items from a dictionary.

Parameters:

connection (Connection) – Database connection used to retrieve data.
collections_subset (optional) – Collections of the database from which to retrieve the data.

Examples

To extract a single document from the collections colls of the database dbname:

import querier as qr

colls = ["collection A", "collection B"]
with qr.Connection(dbname) as con:
    result = con[colls].extract_one()

aggregate(pipeline, **aggregate_kwargs)[source]

Extract entries from the database resulting from a processing pipeline.

To limit the number of entries that will be returned, use Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.

To iterate through the entries, see querier.Result

Parameters:

pipeline (list) – List of aggregation pipeline stages.
**aggregate_kwargs – additional keyword arguments to pass on to pymongo.collection.Collection.aggregate().

Return type:

Result

Returns:

Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.

count_entries(filter=None)[source]

Count the entries that matches a filter.

Parameters:: filter (Filter | None) – Filter object to count entries or None (all entries).
Return type:: int
Returns:: The number of entries in collection matching filter.

Examples

Count how many tweets from Spain in 2014 have more than 500 favorites:

>>> import querier as qr
>>> f = qr.Filter()
>>> f.greater_than('favorite_count', 500)
>>> with qr.Connection('twitter_2014') as con:
>>>     count = con['spain'].count_entries(f)
3

distinct(field_name, filter=None)[source]

Return a set with all the possible values that the field can take.

Parameters:

field_name (str) – The name of the field to test.
filter (Filter | None) – Filter to test the entries.

Return type:

set

Returns:

Set of distinct values.

Examples

>>> import querier as qr
>>> with qr.Connection('twitter_2020') as con:
>>>     con.distinct('place.country')
{'Spain', 'France', 'Portugal', 'Germany', ...}

extract(filter=None, fields=None)[source]

Extract entries from the database that matches a filter.

To limit the number of entries that will be returned, use Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.

To iterate through the entries, see querier.Result

Parameters:

filter (Filter | None) – Filter to test the entries.
fields (list | None) – List of selected fields from the original database that the result dictionaries will have. This is useful when only a subset of the fields is needed.

Return type:

Result

Returns:

Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.

extract_one(filter=None)[source]

Extract an entry from the database that matches a filter.

Parameters:: filter (Filter | None) – Filter used to test the entries or None (all entries)
Return type:: dict | None
Returns:: Entry from the database or None if no entry matches the filter.

groupby(field_name, pre_filter=None, post_filter=None, silence_warning=False, **aggregate_kwargs)[source]

Group by a given field.

Initialize an aggregation pipeline in the collections given by collections_subset (the default None meaning all available collections), in which we filter according to a pre_filter, group by the field(s) field_name, and then filter according to a post_filter. The aggregations done in the groupby stage are specified by a subsequent call to MongoGroupBy.agg().

Parameters:

field_name – Name or list of names of the field(s) by which to group. Note that nested fields using the dot notation will appear in the output with dots (‘.’) replaced by underscores (‘_’).
pre_filter – Filter to apply before the aggregation.
post_filter – Filter to apply after the aggregation.
silence_warning – If True, silence the warning about doing an aggregation on multiple collections.
**aggregate_kwargs – additional keyword arguments to pass on to pymongo.collection.Collection.aggregate().

Returns:

A MongoGroupBy instance enabling aggregation by field_name.

class querier.MongoGroupBy(collections_accessor, pipeline, silence_warning=False, **agg_kwargs)[source]

Enables aggregation by a pre-determined field.

agg(**aggregations)[source]

Perform an aggregation over the grouped-by-field.

Parameters:: **aggregations – Works on the model of named aggregations of pandas.core.groupby.DataFrameGroupBy.aggregate(), except we provide a querier.NamedAgg() with keywords field and aggfunc. For reference see pandas’ user guide.
Return type:: Result
Returns:: Result over which to iterate to obtain the output of the aggregation.

Examples

Count the number of tweets by place in “collection” of database “twitter_2020”:

import querier as qr

with qr.Connection("twitter_2020") as con:
    con["collection"].groupby("place.id", allowDiskUse=True).agg(
        name=qr.NamedAgg(field="place.name", aggfunc="first"),
        nr_tweets=qr.NamedAgg(field="id", aggfunc="count"),
    )

class querier.NamedAgg(field: str, aggfunc: str)[source]

NamedTuple describing an aggregation to pass on to MongoGroupBy.agg().

Parameters:

field (str) – The name of the field on which to apply the aggregation.
aggfunc (str) – The name of the aggregation function to apply. Most common aggregation functions work (“sum”, “min”, “mean”…). For a full reference see https://docs.mongodb.com/manual/reference/operator/aggregation/group/#accumulator-operator