Connection class
- class querier.Connection(dbnamecfg, credentials_path='~/.credentials.cfg')[source]
Establishes a connection with a database and allows data retrieval.
The first parameter (dbnamecfg) must be a valid database name. The database administrator should provide both the credentials and the database names you are allowed to access.
- Parameters:
Examples
The following snippet shows the most simple way to create a Connection:
import querier as qr with qr.Connection(dbname) as con: # Use con
where dbname is the name of a database.
- aggregate(pipeline, collections_subset=None, **aggregate_kwargs)[source]
Extract entries from the database resulting from a processing pipeline.
To limit the number of entries that will be returned, use
Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.To iterate through the entries, see
querier.Result.- Parameters:
pipeline (list[dict]) – List of aggregation pipeline stages.
collections_subset (list | None) – List of collections to extract from. A subset of
Connection.list_available_collections()or None (all collections).**aggregate_kwargs – additional keyword arguments to pass on to
pymongo.collection.Collection.aggregate()
- Return type:
- Returns:
Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.
- count_entries(filter=None, collections_subset=None)[source]
Count the entries that matches a filter.
- Parameters:
filter (Filter | None) – Filter object to count entries or None (all entries).
collections_subset (list | None) – List of collections to extract from. A subset of
Connection.list_available_collections()or None (all collections).
- Return type:
- Returns:
The number of entries in collection matching filter.
Examples
Count how many tweets from Spain in 2014 have more than 500 favorites:
>>> import querier as qr >>> f = qr.Filter() >>> f.greater_than('favorite_count', 500) >>> with qr.Connection('twitter_2014') as con: >>> count = con.count_entries(f, collection='spain') 3
- distinct(field_name, filter=None, collections_subset=None)[source]
Return a set with all the possible values that the field can take.
- Parameters:
field_name (str) – The name of the field to test.
filter (Filter | None) – Filter to test the entries.
collections_subset (list | None) – List of collections to extract from. A subset of
Connection.list_available_collections()or None (all collections).
- Return type:
- Returns:
Set of distinct values.
Examples
>>> import querier as qr >>> with qr.Connection('twitter_2020') as con: >>> con.distinct('place.country') {'Spain', 'France', 'Portugal', 'Germany', ...}
- extract(filter=None, fields=None, collections_subset=None)[source]
Extract entries from the database that matches a filter.
To limit the number of entries that will be returned, use
Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.To iterate through the entries, see
querier.Result- Parameters:
filter (Filter | None) – Filter to test the entries.
fields (list | None) – List of selected fields from the original database that the result dictionaries will have. This is useful when only a subset of the fields is needed.
collections_subset (list | None) – List of collections to extract from. A subset of
Connection.list_available_collections()or None (all collections).
- Return type:
- Returns:
Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.
- extract_one(filter=None, collections_subset=None)[source]
Extract an entry from the database that matches a filter.
- Parameters:
filter (Filter | None) – Filter used to test the entries or None (all entries)
collections_subset (list | None) – List of collections to extract from. A subset of
Connection.list_available_collections()or None (all collections).
- Return type:
dict | None
- Returns:
Entry from the database or None if no entry matches the filter.
- groupby(field_name, collection, pre_filter=None, post_filter=None, **aggregate_kwargs)[source]
Group by a given field.
Initialize an aggregation pipeline in the collections given by collections_subset (the default None meaning all available collections), in which we filter according to a pre_filter, group by the field field_name, and then filter according to a post_filter. The aggregations done in the groupby stage are specified by a subsequent call to
MongoGroupBy.agg().- Parameters:
field_name – Name or list of names of the field(s) by which to group. Note that nested fields using the dot notation will appear in the output with dots (‘.’) replaced by underscores (‘_’).
collection – Name of the collection to perform the aggregation from.
pre_filter – Filter to apply before the aggregation.
post_filter – Filter to apply after the aggregation.
silence_warning – If True, silence the warning about doing an aggregation on multiple collections.
**aggregate_kwargs – additional keyword arguments to pass on to
pymongo.collection.Collection.aggregate().
- Returns:
A MongoGroupBy instance enabling aggregation by field_name.
- list_available_collections()[source]
Return a list of available collections.
MongoDB databases can be split by collections. For example, all tweets from USA are in collection ‘northern_america’ in twitter databases. Contact the database administrator to know how/if collections are semantically split.
Extraction methods can be sped up by using a subset of collections.
- Return type:
- Returns:
A list of all the available collection names.
- class querier.CollectionsAccessor(connection, collections_subset=None)[source]
Provides access to one or several collections of a database for data retrieval.
An instance of this class is normally obtained through a
Connectioninstance, selecting with square brackets the collection(s), as one would select items from a dictionary.- Parameters:
connection (Connection) – Database connection used to retrieve data.
collections_subset (optional) – Collections of the database from which to retrieve the data.
Examples
To extract a single document from the collections colls of the database dbname:
import querier as qr colls = ["collection A", "collection B"] with qr.Connection(dbname) as con: result = con[colls].extract_one()
- aggregate(pipeline, **aggregate_kwargs)[source]
Extract entries from the database resulting from a processing pipeline.
To limit the number of entries that will be returned, use
Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.To iterate through the entries, see
querier.Result- Parameters:
pipeline (
list) – List of aggregation pipeline stages.**aggregate_kwargs – additional keyword arguments to pass on to
pymongo.collection.Collection.aggregate().
- Return type:
- Returns:
Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.
- count_entries(filter=None)[source]
Count the entries that matches a filter.
- Parameters:
filter (Filter | None) – Filter object to count entries or None (all entries).
- Return type:
- Returns:
The number of entries in collection matching filter.
Examples
Count how many tweets from Spain in 2014 have more than 500 favorites:
>>> import querier as qr >>> f = qr.Filter() >>> f.greater_than('favorite_count', 500) >>> with qr.Connection('twitter_2014') as con: >>> count = con['spain'].count_entries(f) 3
- distinct(field_name, filter=None)[source]
Return a set with all the possible values that the field can take.
- Parameters:
- Return type:
- Returns:
Set of distinct values.
Examples
>>> import querier as qr >>> with qr.Connection('twitter_2020') as con: >>> con.distinct('place.country') {'Spain', 'France', 'Portugal', 'Germany', ...}
- extract(filter=None, fields=None)[source]
Extract entries from the database that matches a filter.
To limit the number of entries that will be returned, use
Result.limit(). As databases can contain a huge number of entries, it is advised to test the code with a limited result first.To iterate through the entries, see
querier.Result- Parameters:
- Return type:
- Returns:
Result enabling to iterate through the matching entries in the database, or None if no entry matches the filter.
- groupby(field_name, pre_filter=None, post_filter=None, silence_warning=False, **aggregate_kwargs)[source]
Group by a given field.
Initialize an aggregation pipeline in the collections given by collections_subset (the default None meaning all available collections), in which we filter according to a pre_filter, group by the field(s) field_name, and then filter according to a post_filter. The aggregations done in the groupby stage are specified by a subsequent call to
MongoGroupBy.agg().- Parameters:
field_name – Name or list of names of the field(s) by which to group. Note that nested fields using the dot notation will appear in the output with dots (‘.’) replaced by underscores (‘_’).
pre_filter – Filter to apply before the aggregation.
post_filter – Filter to apply after the aggregation.
silence_warning – If True, silence the warning about doing an aggregation on multiple collections.
**aggregate_kwargs – additional keyword arguments to pass on to
pymongo.collection.Collection.aggregate().
- Returns:
A MongoGroupBy instance enabling aggregation by field_name.
- class querier.MongoGroupBy(collections_accessor, pipeline, silence_warning=False, **agg_kwargs)[source]
Enables aggregation by a pre-determined field.
- agg(**aggregations)[source]
Perform an aggregation over the grouped-by-field.
- Parameters:
**aggregations – Works on the model of named aggregations of
pandas.core.groupby.DataFrameGroupBy.aggregate(), except we provide aquerier.NamedAgg()with keywords field and aggfunc. For reference see pandas’ user guide.- Return type:
- Returns:
Result over which to iterate to obtain the output of the aggregation.
Examples
Count the number of tweets by place in “collection” of database “twitter_2020”:
import querier as qr with qr.Connection("twitter_2020") as con: con["collection"].groupby("place.id", allowDiskUse=True).agg( name=qr.NamedAgg(field="place.name", aggfunc="first"), nr_tweets=qr.NamedAgg(field="id", aggfunc="count"), )
- class querier.NamedAgg(field: str, aggfunc: str)[source]
NamedTuple describing an aggregation to pass on to
MongoGroupBy.agg().- Parameters:
field (str) – The name of the field on which to apply the aggregation.
aggfunc (str) – The name of the aggregation function to apply. Most common aggregation functions work (“sum”, “min”, “mean”…). For a full reference see https://docs.mongodb.com/manual/reference/operator/aggregation/group/#accumulator-operator