GEOparse package

Submodules

GEOparse.GEOTypes module

Classes that represent different GEO entities

class GEOparse.GEOTypes.BaseGEO(name, metadata)[source]

Bases: object

Initialize base GEO object.

Parameters:
  • name (str) – Name of the object.
  • metadata (dict) – Metadata information.
Raises:

TypeError – Metadata should be a dict.

geotype = None
get_accession()[source]

Return accession ID of the sample.

Returns:GEO accession ID
Return type:str
get_metadata_attribute(metaname)[source]

Get the metadata attribute by the name.

Parameters:

metaname (str) – Name of the attribute

Returns:

Value(s) of the requested metadata

attribute

Return type:

list or str

Raises:
get_type()[source]

Get the type of the GEO.

Returns:Type attribute of the GEO
Return type:str
show_metadata()[source]

Print metadata in SOFT format.

to_soft(path_or_handle, as_gzip=False)[source]

Save the object in a SOFT format.

Parameters:
  • path_or_handle (str or file) – Path or handle to output file
  • as_gzip (bool) – Save as gzip
exception GEOparse.GEOTypes.DataIncompatibilityException[source]

Bases: exceptions.Exception

class GEOparse.GEOTypes.GDS(name, metadata, table, columns, subsets, database=None)[source]

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents a dataset from GEO database

Initialize GDS

Parameters:
  • name (str) – Name of the object.
  • metadata (dict) – Metadata information.
  • table (pandas.DataFrame) – Table with the data from SOFT file.
  • columns (pandas.DataFrame) – description of the columns, number of columns, order, and names represented as index in this DataFrame has to be the same as table.columns.
  • subsets (dict of GEOparse.GDSSubset) – GDSSubset from GDS soft file.
  • database (GEOparse.Database, optional) – Database from SOFT file. Defaults to None.
geotype = 'DATASET'
class GEOparse.GEOTypes.GDSSubset(name, metadata)[source]

Bases: GEOparse.GEOTypes.BaseGEO

Class that represents a subset from GEO GDS object.

Initialize base GEO object.

Parameters:
  • name (str) – Name of the object.
  • metadata (dict) – Metadata information.
Raises:

TypeError – Metadata should be a dict.

geotype = 'SUBSET'
class GEOparse.GEOTypes.GEODatabase(name, metadata)[source]

Bases: GEOparse.GEOTypes.BaseGEO

Class that represents a subset from GEO GDS object.

Initialize base GEO object.

Parameters:
  • name (str) – Name of the object.
  • metadata (dict) – Metadata information.
Raises:

TypeError – Metadata should be a dict.

geotype = 'DATABASE'
class GEOparse.GEOTypes.GPL(name, metadata, table=None, columns=None, gses=None, gsms=None, database=None)[source]

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents platform from GEO database

Initialize GPL.

Parameters:
  • name (str) – Name of the object
  • metadata (dict) – Metadata information
  • table (pandas.DataFrame, optional) – Table with actual GPL data
  • columns (pandas.DataFrame, optional) – Table with description of the columns. Defaults to None.
  • gses (dict of GEOparse.GSE, optional) – A dictionary of GSE objects. Defaults to None.
  • gsms (dict of GEOparse.GSM, optional) – A dictionary of GSM objects. Defaults to None.
  • database (GEOparse.GEODatabase, optional) – A database object from SOFT file associated with GPL. Defaults to None.
geotype = 'PLATFORM'
class GEOparse.GEOTypes.GSE(name, metadata, gpls=None, gsms=None, database=None)[source]

Bases: GEOparse.GEOTypes.BaseGEO

Class representing GEO series

Initialize GSE.

Parameters:
  • name (str) – Name of the object.
  • metadata (dict) – Metadata information.
  • gpls (dict of GEOparse.GPL, optional) – A dictionary of GSE objects. Defaults to None.
  • gsms (dict of GEOparse.GSM, optional) – A dictionary of GSM objects. Defaults to None.
  • database (GEOparse.Database, optional) – Database from SOFT file. Defaults to None.
download_SRA(email, directory='series', filterby=None, nproc=1, **kwargs)[source]

Download SRA files for each GSM in series.

Warning

Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.

Parameters:
  • email (str) – E-mail that will be provided to the Entrez.
  • directory (str, optional) – Directory to save the data (defaults to the ‘series’ which saves the data to the directory with the name of the series + ‘_SRA’ ending). Defaults to “series”.
  • filterby (str, optional) – Filter GSM objects, argument is a function that operates on GSM object and return bool eg. lambda x: “brain” not in x.name. Defaults to None.
  • nproc (int, optional) – Number of processes for SRA download (default is 1, no parallelization).
  • **kwargs – Any arbitrary argument passed to GSM.download_SRA method. See the documentation for more details.
  • Returns
    dict: A dictionary containing output of GSM.download_SRA
    method where each GSM accession ID is the key for the output.
download_supplementary_files(directory='series', download_sra=True, email=None, sra_kwargs=None, nproc=1)[source]

Download supplementary data.

Warning

Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.

Parameters:
  • directory (str, optional) – Directory to download the data (in this directory function will create new directory with the files), by default this will be named with the series name + _Supp.
  • download_sra (bool, optional) – Indicates whether to download SRA raw data too. Defaults to True.
  • email (str, optional) – E-mail that will be provided to the Entrez. Defaults to None.
  • sra_kwargs (dict, optional) – Kwargs passed to the GSM.download_SRA method. Defaults to None.
  • nproc (int, optional) – Number of processes for SRA download (default is 1, no parallelization).
Returns:

Downloaded data for each of the GSM

Return type:

dict

geotype = 'SERIES'
merge_and_average(platform, expression_column, group_by_column, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]

Merge and average GSE samples.

For given platform prepare the DataFrame with all the samples present in the GSE annotated with given column from platform and averaged over the column.

Parameters:
  • platform (str or GEOparse.GPL) – GPL platform to use.
  • expression_column (str) – Column name in which “expressions” are represented
  • group_by_column (str) – The data will be grouped and averaged over this column and only this column will be kept
  • force (bool) – If the name of the GPL does not match the platform name in GSM proceed anyway
  • merge_on_column (str) – Column to merge the data on - should be present in both GSM and GPL
  • gsm_on (str) – In the case columns to merge are different in GSM and GPL use this column in GSM
  • gpl_on (str) – In the case columns to merge are different in GSM and GPL use this column in GPL
Returns:

Merged and averaged table of results.

Return type:

pandas.DataFrame

phenotype_data

Get the phenotype data for each of the sample.

pivot_and_annotate(values, gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF')[source]

Annotate GSM with provided GPL.

Parameters:
  • values (str) – Column to use as values eg. “VALUES”
  • gpl (pandas.DataFrame or GEOparse.GPL) – A Platform or DataFrame to annotate with.
  • annotation_column (str) – Column in table for annotation.
  • gpl_on (str, optional) – Use this column in GPL to merge. Defaults to “ID”.
  • gsm_on (str, optional) – Use this column in GSM to merge. Defaults to “ID_REF”.
Returns:

Pivoted and annotated table of results

Return type:

pandas.DataFrame

pivot_samples(values, index='ID_REF')[source]

Pivot samples by specified column.

Construct a table in which columns (names) are the samples, index is a specified column eg. ID_REF and values in the columns are of one specified type.

Parameters:
  • values (str) – Column name present in all GSMs.
  • index (str, optional) – Column name that will become an index in pivoted table. Defaults to “ID_REF”.
Returns:

Pivoted data

Return type:

pandas.DataFrame

class GEOparse.GEOTypes.GSM(name, metadata, table, columns)[source]

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents sample from GEO database.

Initialize simple GEO object.

Parameters:
  • name (str) – Name of the object
  • metadata (dict) – Metadata information
  • table (pandas.DataFrame) – Table with the data from SOFT file
  • columns (pandas.DataFrame) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises:
  • ValueError – Table should be a DataFrame
  • ValueError – Columns’ description should be a DataFrame
  • DataIncompatibilityException – Columns are wrong
  • ValueError – Description has to be present in columns
annotate(gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF', in_place=False)[source]

Annotate GSM with provided GPL

Parameters:
  • gpl (pandas.DataFrame) – A Platform or DataFrame to annotate with
  • annotation_column (str`) – Column in a table for annotation
  • gpl_on (str) – Use this column in GSM to merge. Defaults to “ID”.
  • gsm_on (str) – Use this column in GPL to merge. Defaults to “ID_REF”.
  • in_place (bool) – Substitute table in GSM by new annotated table. Defaults to False.
Returns:

Annotated table or None

Return type:

pandas.DataFrame or None

Raises:

TypeError – GPL should be GPL or pandas.DataFrame

annotate_and_average(gpl, expression_column, group_by_column, rename=True, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]

Annotate GSM table with provided GPL.

Parameters:
  • gpl (GEOTypes.GPL) – Platform for annotations
  • expression_column (str) – Column name which “expressions” are represented
  • group_by_column (str) – The data will be grouped and averaged over this column and only this column will be kept
  • rename (bool) – Rename output column to the self.name. Defaults to True.
  • force (bool) – If the name of the GPL does not match the platform name in GSM proceed anyway. Defaults to False.
  • merge_on_column (str) – Column to merge the data on. Defaults to None.
  • gsm_on (str) – In the case columns to merge are different in GSM and GPL use this column in GSM. Defaults to None.
  • gpl_on (str) – In the case columns to merge are different in GSM and GPL use this column in GPL. Defaults to None.
Returns:

Annotated data

Return type:

pandas.DataFrame

download_SRA(email, directory='./', **kwargs)[source]

Download RAW data as SRA file.

The files will be downloaded to the sample directory created ad hoc or the directory specified by the parameter. The sample has to come from sequencing eg. mRNA-seq, CLIP etc.

An important parameter is a filetype. By default an SRA is accessed by FTP and such file is downloaded. This does not require additional libraries. However in order to produce FASTA of FASTQ files one would need to use SRA-Toolkit. Thus, it is assumed that this library is already installed or it will be installed in the near future. One can immediately specify the download type to fasta or fastq.

To see all possible **kwargs that could be passed to the function see the description of SRADownloader.

Parameters:
  • email (str) – an email (any) - Required by NCBI for access
  • directory (str, optional) – The directory to which download the data. Defaults to “./”.
  • **kwargs – Arbitrary keyword arguments, see description
Returns:

A dictionary containing only one key (SRA) with

the list of downloaded files.

Return type:

dict

Raises:
  • TypeError – Type to download unknown
  • NoSRARelationException – No SRAToolkit
  • Exception – Wrong e-mail
  • HTTPError – Cannot access or connect to DB
download_supplementary_files(directory='./', download_sra=True, email=None, sra_kwargs=None)[source]

Download all supplementary data available for the sample.

Parameters:
  • directory (str) – Directory to download the data (in this directory function will create new directory with the files). Defaults to “./”.
  • download_sra (bool) – Indicates whether to download SRA raw data too. Defaults to True.
  • email (str) – E-mail that will be provided to the Entrez. It is mandatory if download_sra=True. Defaults to None.
  • sra_kwargs (dict, optional) – Kwargs passed to the download_SRA method. Defaults to None.
Returns:

A key-value pair of name taken from the metadata and

paths downloaded, in the case of SRA files the key is SRA.

Return type:

dict

geotype = 'SAMPLE'
exception GEOparse.GEOTypes.NoMetadataException[source]

Bases: exceptions.Exception

class GEOparse.GEOTypes.SimpleGEO(name, metadata, table, columns)[source]

Bases: GEOparse.GEOTypes.BaseGEO

Initialize simple GEO object.

Parameters:
  • name (str) – Name of the object
  • metadata (dict) – Metadata information
  • table (pandas.DataFrame) – Table with the data from SOFT file
  • columns (pandas.DataFrame) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises:
  • ValueError – Table should be a DataFrame
  • ValueError – Columns’ description should be a DataFrame
  • DataIncompatibilityException – Columns are wrong
  • ValueError – Description has to be present in columns
head()[source]

Print short description of the object.

show_columns()[source]

Print columns in SOFT format.

show_table(number_of_lines=5)[source]

Show few lines of the table the table as pandas.DataFrame.

Parameters:number_of_lines (int) – Number of lines to show. Defaults to 5.

GEOparse.GEOparse module

exception GEOparse.GEOparse.NoEntriesException[source]

Bases: exceptions.Exception

Raised when no entries could be found in the SOFT file.

exception GEOparse.GEOparse.UnknownGEOTypeException[source]

Bases: exceptions.Exception

Raised when the GEO type that do not correspond to any known.

GEOparse.GEOparse.get_GEO(geo=None, filepath=None, destdir='./', how='full', annotate_gpl=False, geotype=None, include_data=False, silent=False, aspera=False, partial=None)[source]

Get the GEO entry.

The GEO entry is taken directly from the GEO database or read it from SOFT file.

Parameters:
  • geo (str) – GEO database identifier.
  • filepath (str) – Path to local SOFT file. Defaults to None.
  • destdir (str, optional) – Directory to download data. Defaults to None.
  • how (str, optional) – GSM download mode. Defaults to “full”.
  • annotate_gpl (bool, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False.
  • geotype (str, optional) – Type of GEO entry. By default it is inferred from the ID or the file name.
  • include_data (bool, optional) – Full download of GPLs including series and samples. Defaults to False.
  • silent (bool, optional) – Do not print anything. Defaults to False.
  • aspera (bool, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False.
  • ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns:

A GEO object of given type.

Return type:

GEOparse.BaseGEO

GEOparse.GEOparse.get_GEO_file(geo, destdir=None, annotate_gpl=False, how='full', include_data=False, silent=False, aspera=False)[source]

Download corresponding SOFT file given GEO accession.

Parameters:
  • geo (str) – GEO database identifier.
  • destdir (str, optional) – Directory to download data. Defaults to None.
  • annotate_gpl (bool, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False.
  • how (str, optional) – GSM download mode. Defaults to “full”.
  • include_data (bool, optional) – Full download of GPLs including series and samples. Defaults to False.
  • silent (bool, optional) – Do not print anything. Defaults to False.
  • aspera (bool, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False.
Returns:

Path to downloaded file and and the type of GEO object.

Return type:

2-tuple of str and str

GEOparse.GEOparse.parse_GDS(filepath)[source]

Parse GDS SOFT file.

Parameters:filepath (str) – Path to GDS SOFT file.
Returns:A GDS object.
Return type:GEOparse.GDS
GEOparse.GEOparse.parse_GDS_columns(lines, subsets)[source]

Parse list of line with columns description from SOFT file of GDS.

Parameters:
  • lines (Iterable) – Iterator over the lines.
  • subsets (dict of GEOparse.GDSSubset) – Subsets to use.
Returns:

Columns description.

Return type:

pandas.DataFrame

GEOparse.GEOparse.parse_GPL(filepath, entry_name=None, partial=None)[source]

Parse GPL entry from SOFT file.

Parameters:
  • filepath (str or Iterable) – Path to file with 1 GPL entry or list of lines representing GPL from GSE file.
  • entry_name (str, optional) – Name of the entry. By default it is inferred from the data.
  • ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns:

A GPL object.

Return type:

GEOparse.GPL

GEOparse.GEOparse.parse_GSE(filepath)[source]

Parse GSE SOFT file.

Parameters:filepath (str) – Path to GSE SOFT file.
Returns:A GSE object.
Return type:GEOparse.GSE
GEOparse.GEOparse.parse_GSM(filepath, entry_name=None)[source]

Parse GSM entry from SOFT file.

Parameters:
  • filepath (str or Iterable) – Path to file with 1 GSM entry or list of lines representing GSM from GSE file.
  • entry_name (str, optional) – Name of the entry. By default it is inferred from the data.
Returns:

A GSM object.

Return type:

GEOparse.GSM

GEOparse.GEOparse.parse_columns(lines)[source]

Parse list of lines with columns description from SOFT file.

Parameters:lines (Iterable) – Iterator over the lines.
Returns:Columns description.
Return type:pandas.DataFrame
GEOparse.GEOparse.parse_entry_name(nameline)[source]

Parse line that starts with ^ and assign the name to it.

Parameters:nameline (str) – A line to process.
Returns:Entry name.
Return type:str
GEOparse.GEOparse.parse_metadata(lines)[source]

Parse list of lines with metadata information from SOFT file.

Parameters:lines (Iterable) – Iterator over the lines.
Returns:Metadata from SOFT file.
Return type:dict
GEOparse.GEOparse.parse_table_data(lines)[source]

“Parse list of lines from SOFT file into DataFrame.

Parameters:lines (Iterable) – Iterator over the lines.
Returns:Table data.
Return type:pandas.DataFrame

GEOparse.logger module

GEOparse.logger.set_verbosity(level)[source]

Set the log level.

Parameters:level (str) – Level name eg. DEBUG or ERROR
GEOparse.logger.add_log_file(path)[source]

Add log file.

Parameters:path (str) – Path to the log file.

GEOparse.utils module

GEOparse.utils.download_from_url(url, destination_path, force=False, aspera=False, silent=False)[source]

Download file from remote server.

If the file is already downloaded and force flag is on the file will be removed.

Parameters:
  • url (str) – Path to the file on remote server (including file name)
  • destination_path (str) – Path to the file on local machine (including file name)
  • force (bool) – If file exist force to overwrite it. Defaults to False.
  • aspera (bool) – Download with Aspera Connect. Defaults to False.
  • silent (bool) – Do not print any message. Defaults to False.
GEOparse.utils.mkdir_p(path_to_dir)[source]

Make directory(ies).

This function behaves like mkdir -p.

Parameters:path_to_dir (str) – Path to the directory to make.
GEOparse.utils.smart_open(*args, **kwds)[source]

Open file intelligently depending on the source and python version.

Parameters:filepath (str) – Path to the file.
Yields:Context manager for file handle.
GEOparse.utils.which(program)[source]

Check if executable exists.

The code is taken from: https://stackoverflow.com/questions/377017/test-if-executable-exists-in-python :param program: Path to the executable. :type program: str

Returns:Path to the program or None.
Return type:str or None

Module contents