GEOparse package¶
Submodules¶
GEOparse.GEOTypes module¶
Classes that represent different GEO entities
-
class
GEOparse.GEOTypes.
BaseGEO
(name, metadata)[source]¶ Bases:
object
Initialize base GEO object.
Parameters: - name (
str
) – Name of the object. - metadata (
dict
) – Metadata information.
Raises: TypeError – Metadata should be a dict.
-
geotype
= None¶
-
get_accession
()[source]¶ Return accession ID of the sample.
Returns: GEO accession ID Return type: str
-
get_metadata_attribute
(metaname)[source]¶ Get the metadata attribute by the name.
Parameters: metaname (
str
) – Name of the attributeReturns: - Value(s) of the requested metadata
attribute
Return type: list
orstr
Raises: - NoMetadataException – Attribute error
- TypeError – Metadata should be a list
- name (
-
class
GEOparse.GEOTypes.
GDS
(name, metadata, table, columns, subsets, database=None)[source]¶ Bases:
GEOparse.GEOTypes.SimpleGEO
Class that represents a dataset from GEO database
Initialize GDS
Parameters: - name (
str
) – Name of the object. - metadata (
dict
) – Metadata information. - table (
pandas.DataFrame
) – Table with the data from SOFT file. - columns (
pandas.DataFrame
) – description of the columns, number of columns, order, and names represented as index in this DataFrame has to be the same as table.columns. - subsets (
dict
ofGEOparse.GDSSubset
) – GDSSubset from GDS soft file. - database (
GEOparse.Database
, optional) – Database from SOFT file. Defaults to None.
-
geotype
= 'DATASET'¶
- name (
-
class
GEOparse.GEOTypes.
GDSSubset
(name, metadata)[source]¶ Bases:
GEOparse.GEOTypes.BaseGEO
Class that represents a subset from GEO GDS object.
Initialize base GEO object.
Parameters: - name (
str
) – Name of the object. - metadata (
dict
) – Metadata information.
Raises: TypeError – Metadata should be a dict.
-
geotype
= 'SUBSET'¶
- name (
-
class
GEOparse.GEOTypes.
GEODatabase
(name, metadata)[source]¶ Bases:
GEOparse.GEOTypes.BaseGEO
Class that represents a subset from GEO GDS object.
Initialize base GEO object.
Parameters: - name (
str
) – Name of the object. - metadata (
dict
) – Metadata information.
Raises: TypeError – Metadata should be a dict.
-
geotype
= 'DATABASE'¶
- name (
-
class
GEOparse.GEOTypes.
GPL
(name, metadata, table=None, columns=None, gses=None, gsms=None, database=None)[source]¶ Bases:
GEOparse.GEOTypes.SimpleGEO
Class that represents platform from GEO database
Initialize GPL.
Parameters: - name (
str
) – Name of the object - metadata (
dict
) – Metadata information - table (
pandas.DataFrame
, optional) – Table with actual GPL data - columns (
pandas.DataFrame
, optional) – Table with description of the columns. Defaults to None. - gses (
dict
ofGEOparse.GSE
, optional) – A dictionary of GSE objects. Defaults to None. - gsms (
dict
ofGEOparse.GSM
, optional) – A dictionary of GSM objects. Defaults to None. - database (
GEOparse.GEODatabase
, optional) – A database object from SOFT file associated with GPL. Defaults to None.
-
geotype
= 'PLATFORM'¶
- name (
-
class
GEOparse.GEOTypes.
GSE
(name, metadata, gpls=None, gsms=None, database=None)[source]¶ Bases:
GEOparse.GEOTypes.BaseGEO
Class representing GEO series
Initialize GSE.
Parameters: - name (
str
) – Name of the object. - metadata (
dict
) – Metadata information. - gpls (
dict
ofGEOparse.GPL
, optional) – A dictionary of GSE objects. Defaults to None. - gsms (
dict
ofGEOparse.GSM
, optional) – A dictionary of GSM objects. Defaults to None. - database (
GEOparse.Database
, optional) – Database from SOFT file. Defaults to None.
-
download_SRA
(email, directory='series', filterby=None, nproc=1, **kwargs)[source]¶ Download SRA files for each GSM in series.
Warning
Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.
Parameters: - email (
str
) – E-mail that will be provided to the Entrez. - directory (
str
, optional) – Directory to save the data (defaults to the ‘series’ which saves the data to the directory with the name of the series + ‘_SRA’ ending). Defaults to “series”. - filterby (
str
, optional) – Filter GSM objects, argument is a function that operates on GSM object and return bool eg. lambda x: “brain” not in x.name. Defaults to None. - nproc (
int
, optional) – Number of processes for SRA download (default is 1, no parallelization). - **kwargs – Any arbitrary argument passed to GSM.download_SRA method. See the documentation for more details.
- Returns –
dict
: A dictionary containing output ofGSM.download_SRA
- method where each GSM accession ID is the key for the output.
- email (
-
download_supplementary_files
(directory='series', download_sra=True, email=None, sra_kwargs=None, nproc=1)[source]¶ Download supplementary data.
Warning
Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.
Parameters: - directory (
str
, optional) – Directory to download the data (in this directory function will create new directory with the files), by default this will be named with the series name + _Supp. - download_sra (
bool
, optional) – Indicates whether to download SRA raw data too. Defaults to True. - email (
str
, optional) – E-mail that will be provided to the Entrez. Defaults to None. - sra_kwargs (
dict
, optional) – Kwargs passed to the GSM.download_SRA method. Defaults to None. - nproc (
int
, optional) – Number of processes for SRA download (default is 1, no parallelization).
Returns: Downloaded data for each of the GSM
Return type: dict
- directory (
-
geotype
= 'SERIES'¶
-
merge_and_average
(platform, expression_column, group_by_column, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]¶ Merge and average GSE samples.
For given platform prepare the DataFrame with all the samples present in the GSE annotated with given column from platform and averaged over the column.
Parameters: - platform (
str
orGEOparse.GPL
) – GPL platform to use. - expression_column (
str
) – Column name in which “expressions” are represented - group_by_column (
str
) – The data will be grouped and averaged over this column and only this column will be kept - force (
bool
) – If the name of the GPL does not match the platform name in GSM proceed anyway - merge_on_column (
str
) – Column to merge the data on - should be present in both GSM and GPL - gsm_on (
str
) – In the case columns to merge are different in GSM and GPL use this column in GSM - gpl_on (
str
) – In the case columns to merge are different in GSM and GPL use this column in GPL
Returns: Merged and averaged table of results.
Return type: pandas.DataFrame
- platform (
-
phenotype_data
¶ Get the phenotype data for each of the sample.
-
pivot_and_annotate
(values, gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF')[source]¶ Annotate GSM with provided GPL.
Parameters: - values (
str
) – Column to use as values eg. “VALUES” - gpl (
pandas.DataFrame
orGEOparse.GPL
) – A Platform or DataFrame to annotate with. - annotation_column (
str
) – Column in table for annotation. - gpl_on (
str
, optional) – Use this column in GPL to merge. Defaults to “ID”. - gsm_on (
str
, optional) – Use this column in GSM to merge. Defaults to “ID_REF”.
Returns: Pivoted and annotated table of results
Return type: pandas.DataFrame
- values (
-
pivot_samples
(values, index='ID_REF')[source]¶ Pivot samples by specified column.
Construct a table in which columns (names) are the samples, index is a specified column eg. ID_REF and values in the columns are of one specified type.
Parameters: - values (
str
) – Column name present in all GSMs. - index (
str
, optional) – Column name that will become an index in pivoted table. Defaults to “ID_REF”.
Returns: Pivoted data
Return type: pandas.DataFrame
- values (
- name (
-
class
GEOparse.GEOTypes.
GSM
(name, metadata, table, columns)[source]¶ Bases:
GEOparse.GEOTypes.SimpleGEO
Class that represents sample from GEO database.
Initialize simple GEO object.
Parameters: - name (
str
) – Name of the object - metadata (
dict
) – Metadata information - table (
pandas.DataFrame
) – Table with the data from SOFT file - columns (
pandas.DataFrame
) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises: - ValueError – Table should be a DataFrame
- ValueError – Columns’ description should be a DataFrame
- DataIncompatibilityException – Columns are wrong
- ValueError – Description has to be present in columns
-
annotate
(gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF', in_place=False)[source]¶ Annotate GSM with provided GPL
Parameters: - gpl (
pandas.DataFrame
) – A Platform or DataFrame to annotate with - annotation_column (str`) – Column in a table for annotation
- gpl_on (
str
) – Use this column in GSM to merge. Defaults to “ID”. - gsm_on (
str
) – Use this column in GPL to merge. Defaults to “ID_REF”. - in_place (
bool
) – Substitute table in GSM by new annotated table. Defaults to False.
Returns: Annotated table or None
Return type: pandas.DataFrame
orNone
Raises: TypeError – GPL should be GPL or pandas.DataFrame
- gpl (
-
annotate_and_average
(gpl, expression_column, group_by_column, rename=True, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]¶ Annotate GSM table with provided GPL.
Parameters: - gpl (
GEOTypes.GPL
) – Platform for annotations - expression_column (
str
) – Column name which “expressions” are represented - group_by_column (
str
) – The data will be grouped and averaged over this column and only this column will be kept - rename (
bool
) – Rename output column to the self.name. Defaults to True. - force (
bool
) – If the name of the GPL does not match the platform name in GSM proceed anyway. Defaults to False. - merge_on_column (
str
) – Column to merge the data on. Defaults to None. - gsm_on (
str
) – In the case columns to merge are different in GSM and GPL use this column in GSM. Defaults to None. - gpl_on (
str
) – In the case columns to merge are different in GSM and GPL use this column in GPL. Defaults to None.
Returns: Annotated data
Return type: pandas.DataFrame
- gpl (
-
download_SRA
(email, directory='./', **kwargs)[source]¶ Download RAW data as SRA file.
The files will be downloaded to the sample directory created ad hoc or the directory specified by the parameter. The sample has to come from sequencing eg. mRNA-seq, CLIP etc.
An important parameter is a filetype. By default an SRA is accessed by FTP and such file is downloaded. This does not require additional libraries. However in order to produce FASTA of FASTQ files one would need to use SRA-Toolkit. Thus, it is assumed that this library is already installed or it will be installed in the near future. One can immediately specify the download type to fasta or fastq.
To see all possible
**kwargs
that could be passed to the function see the description ofSRADownloader
.Parameters: - email (
str
) – an email (any) - Required by NCBI for access - directory (
str
, optional) – The directory to which download the data. Defaults to “./”. - **kwargs – Arbitrary keyword arguments, see description
Returns: - A dictionary containing only one key (
SRA
) with the list of downloaded files.
Return type: dict
Raises: - TypeError – Type to download unknown
- NoSRARelationException – No SRAToolkit
- Exception – Wrong e-mail
- HTTPError – Cannot access or connect to DB
- email (
-
download_supplementary_files
(directory='./', download_sra=True, email=None, sra_kwargs=None)[source]¶ Download all supplementary data available for the sample.
Parameters: - directory (
str
) – Directory to download the data (in this directory function will create new directory with the files). Defaults to “./”. - download_sra (
bool
) – Indicates whether to download SRA raw data too. Defaults to True. - email (
str
) – E-mail that will be provided to the Entrez. It is mandatory if download_sra=True. Defaults to None. - sra_kwargs (
dict
, optional) – Kwargs passed to the download_SRA method. Defaults to None.
Returns: - A key-value pair of name taken from the metadata and
paths downloaded, in the case of SRA files the key is
SRA
.
Return type: dict
- directory (
-
geotype
= 'SAMPLE'¶
- name (
-
class
GEOparse.GEOTypes.
SimpleGEO
(name, metadata, table, columns)[source]¶ Bases:
GEOparse.GEOTypes.BaseGEO
Initialize simple GEO object.
Parameters: - name (
str
) – Name of the object - metadata (
dict
) – Metadata information - table (
pandas.DataFrame
) – Table with the data from SOFT file - columns (
pandas.DataFrame
) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises: - ValueError – Table should be a DataFrame
- ValueError – Columns’ description should be a DataFrame
- DataIncompatibilityException – Columns are wrong
- ValueError – Description has to be present in columns
- name (
GEOparse.GEOparse module¶
-
exception
GEOparse.GEOparse.
NoEntriesException
[source]¶ Bases:
exceptions.Exception
Raised when no entries could be found in the SOFT file.
-
exception
GEOparse.GEOparse.
UnknownGEOTypeException
[source]¶ Bases:
exceptions.Exception
Raised when the GEO type that do not correspond to any known.
-
GEOparse.GEOparse.
get_GEO
(geo=None, filepath=None, destdir='./', how='full', annotate_gpl=False, geotype=None, include_data=False, silent=False, aspera=False, partial=None)[source]¶ Get the GEO entry.
The GEO entry is taken directly from the GEO database or read it from SOFT file.
Parameters: - geo (
str
) – GEO database identifier. - filepath (
str
) – Path to local SOFT file. Defaults to None. - destdir (
str
, optional) – Directory to download data. Defaults to None. - how (
str
, optional) – GSM download mode. Defaults to “full”. - annotate_gpl (
bool
, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False. - geotype (
str
, optional) – Type of GEO entry. By default it is inferred from the ID or the file name. - include_data (
bool
, optional) – Full download of GPLs including series and samples. Defaults to False. - silent (
bool
, optional) – Do not print anything. Defaults to False. - aspera (
bool
, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False. - ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns: A GEO object of given type.
Return type: GEOparse.BaseGEO
- geo (
-
GEOparse.GEOparse.
get_GEO_file
(geo, destdir=None, annotate_gpl=False, how='full', include_data=False, silent=False, aspera=False)[source]¶ Download corresponding SOFT file given GEO accession.
Parameters: - geo (
str
) – GEO database identifier. - destdir (
str
, optional) – Directory to download data. Defaults to None. - annotate_gpl (
bool
, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False. - how (
str
, optional) – GSM download mode. Defaults to “full”. - include_data (
bool
, optional) – Full download of GPLs including series and samples. Defaults to False. - silent (
bool
, optional) – Do not print anything. Defaults to False. - aspera (
bool
, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False.
Returns: Path to downloaded file and and the type of GEO object.
Return type: 2-tuple
ofstr
andstr
- geo (
-
GEOparse.GEOparse.
parse_GDS
(filepath)[source]¶ Parse GDS SOFT file.
Parameters: filepath ( str
) – Path to GDS SOFT file.Returns: A GDS object. Return type: GEOparse.GDS
-
GEOparse.GEOparse.
parse_GDS_columns
(lines, subsets)[source]¶ Parse list of line with columns description from SOFT file of GDS.
Parameters: - lines (
Iterable
) – Iterator over the lines. - subsets (
dict
ofGEOparse.GDSSubset
) – Subsets to use.
Returns: Columns description.
Return type: pandas.DataFrame
- lines (
-
GEOparse.GEOparse.
parse_GPL
(filepath, entry_name=None, partial=None)[source]¶ Parse GPL entry from SOFT file.
Parameters: - filepath (
str
orIterable
) – Path to file with 1 GPL entry or list of lines representing GPL from GSE file. - entry_name (
str
, optional) – Name of the entry. By default it is inferred from the data. - ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns: A GPL object.
Return type: GEOparse.GPL
- filepath (
-
GEOparse.GEOparse.
parse_GSE
(filepath)[source]¶ Parse GSE SOFT file.
Parameters: filepath ( str
) – Path to GSE SOFT file.Returns: A GSE object. Return type: GEOparse.GSE
-
GEOparse.GEOparse.
parse_GSM
(filepath, entry_name=None)[source]¶ Parse GSM entry from SOFT file.
Parameters: - filepath (
str
orIterable
) – Path to file with 1 GSM entry or list of lines representing GSM from GSE file. - entry_name (
str
, optional) – Name of the entry. By default it is inferred from the data.
Returns: A GSM object.
Return type: GEOparse.GSM
- filepath (
-
GEOparse.GEOparse.
parse_columns
(lines)[source]¶ Parse list of lines with columns description from SOFT file.
Parameters: lines ( Iterable
) – Iterator over the lines.Returns: Columns description. Return type: pandas.DataFrame
-
GEOparse.GEOparse.
parse_entry_name
(nameline)[source]¶ Parse line that starts with ^ and assign the name to it.
Parameters: nameline ( str
) – A line to process.Returns: Entry name. Return type: str
GEOparse.logger module¶
GEOparse.utils module¶
-
GEOparse.utils.
download_from_url
(url, destination_path, force=False, aspera=False, silent=False)[source]¶ Download file from remote server.
If the file is already downloaded and
force
flag is on the file will be removed.Parameters: - url (
str
) – Path to the file on remote server (including file name) - destination_path (
str
) – Path to the file on local machine (including file name) - force (
bool
) – If file exist force to overwrite it. Defaults to False. - aspera (
bool
) – Download with Aspera Connect. Defaults to False. - silent (
bool
) – Do not print any message. Defaults to False.
- url (
-
GEOparse.utils.
mkdir_p
(path_to_dir)[source]¶ Make directory(ies).
This function behaves like mkdir -p.
Parameters: path_to_dir ( str
) – Path to the directory to make.
-
GEOparse.utils.
smart_open
(*args, **kwds)[source]¶ Open file intelligently depending on the source and python version.
Parameters: filepath ( str
) – Path to the file.Yields: Context manager for file handle.
-
GEOparse.utils.
which
(program)[source]¶ Check if executable exists.
The code is taken from: https://stackoverflow.com/questions/377017/test-if-executable-exists-in-python :param program: Path to the executable. :type program:
str
Returns: Path to the program or None. Return type: str
orNone