GEOparse package¶

Submodules¶

GEOparse.GEOTypes module¶

Classes that represent different GEO entities

class GEOparse.GEOTypes.BaseGEO(name, metadata)[source]¶

Bases: object

Initialize base GEO object.

Parameters:	name (`str`) – Name of the object. metadata (`dict`) – Metadata information.
Raises:	TypeError – Metadata should be a dict.

geotype = None¶

get_accession()[source]¶

Return accession ID of the sample.

Returns:	GEO accession ID
Return type:	`str`

get_metadata_attribute(metaname)[source]¶

Get the metadata attribute by the name.

Parameters:	metaname (`str`) – Name of the attribute
Returns:	Value(s) of the requested metadata attribute
Return type:	`list` or `str`
Raises:	NoMetadataException – Attribute error TypeError – Metadata should be a list

get_type()[source]¶

Get the type of the GEO.

Returns:	Type attribute of the GEO
Return type:	`str`

show_metadata()[source]¶: Print metadata in SOFT format.

to_soft(path_or_handle, as_gzip=False)[source]¶

Save the object in a SOFT format.

Parameters:	path_or_handle (`str` or `file`) – Path or handle to output file as_gzip (`bool`) – Save as gzip

exception GEOparse.GEOTypes.DataIncompatibilityException[source]¶: Bases: exceptions.Exception

class GEOparse.GEOTypes.GDS(name, metadata, table, columns, subsets, database=None)[source]¶

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents a dataset from GEO database

Initialize GDS

Parameters:

name (str) – Name of the object.
metadata (dict) – Metadata information.
table (pandas.DataFrame) – Table with the data from SOFT file.
columns (pandas.DataFrame) – description of the columns, number of columns, order, and names represented as index in this DataFrame has to be the same as table.columns.
subsets (dict of GEOparse.GDSSubset) – GDSSubset from GDS soft file.
database (GEOparse.Database, optional) – Database from SOFT file. Defaults to None.

geotype = 'DATASET'¶

class GEOparse.GEOTypes.GDSSubset(name, metadata)[source]¶

Bases: GEOparse.GEOTypes.BaseGEO

Class that represents a subset from GEO GDS object.

Initialize base GEO object.

Parameters:	name (`str`) – Name of the object. metadata (`dict`) – Metadata information.
Raises:	TypeError – Metadata should be a dict.

geotype = 'SUBSET'¶

class GEOparse.GEOTypes.GEODatabase(name, metadata)[source]¶

Bases: GEOparse.GEOTypes.BaseGEO

Class that represents a subset from GEO GDS object.

Initialize base GEO object.

Parameters:	name (`str`) – Name of the object. metadata (`dict`) – Metadata information.
Raises:	TypeError – Metadata should be a dict.

geotype = 'DATABASE'¶

class GEOparse.GEOTypes.GPL(name, metadata, table=None, columns=None, gses=None, gsms=None, database=None)[source]¶

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents platform from GEO database

Initialize GPL.

Parameters:

name (str) – Name of the object
metadata (dict) – Metadata information
table (pandas.DataFrame, optional) – Table with actual GPL data
columns (pandas.DataFrame, optional) – Table with description of the columns. Defaults to None.
gses (dict of GEOparse.GSE, optional) – A dictionary of GSE objects. Defaults to None.
gsms (dict of GEOparse.GSM, optional) – A dictionary of GSM objects. Defaults to None.
database (GEOparse.GEODatabase, optional) – A database object from SOFT file associated with GPL. Defaults to None.

geotype = 'PLATFORM'¶

class GEOparse.GEOTypes.GSE(name, metadata, gpls=None, gsms=None, database=None)[source]¶

Bases: GEOparse.GEOTypes.BaseGEO

Class representing GEO series

Initialize GSE.

Parameters:	name (`str`) – Name of the object. metadata (`dict`) – Metadata information. gpls (`dict` of `GEOparse.GPL`, optional) – A dictionary of GSE objects. Defaults to None. gsms (`dict` of `GEOparse.GSM`, optional) – A dictionary of GSM objects. Defaults to None. database (`GEOparse.Database`, optional) – Database from SOFT file. Defaults to None.

download_SRA(email, directory='series', filterby=None, nproc=1, **kwargs)[source]¶

Download SRA files for each GSM in series.

Warning

Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.

Parameters:

email (str) – E-mail that will be provided to the Entrez.
directory (str, optional) – Directory to save the data (defaults to the ‘series’ which saves the data to the directory with the name of the series + ‘_SRA’ ending). Defaults to “series”.
filterby (str, optional) – Filter GSM objects, argument is a function that operates on GSM object and return bool eg. lambda x: “brain” not in x.name. Defaults to None.
nproc (int, optional) – Number of processes for SRA download (default is 1, no parallelization).
**kwargs – Any arbitrary argument passed to GSM.download_SRA method. See the documentation for more details.
Returns –

dict: A dictionary containing output of GSM.download_SRA

method where each GSM accession ID is the key for the output.

download_supplementary_files(directory='series', download_sra=True, email=None, sra_kwargs=None, nproc=1)[source]¶

Download supplementary data.

Warning

Do not use parallel option (nproc > 1) in the interactive shell. For more details see this issue on SO.

Parameters:	directory (`str`, optional) – Directory to download the data (in this directory function will create new directory with the files), by default this will be named with the series name + _Supp. download_sra (`bool`, optional) – Indicates whether to download SRA raw data too. Defaults to True. email (`str`, optional) – E-mail that will be provided to the Entrez. Defaults to None. sra_kwargs (`dict`, optional) – Kwargs passed to the GSM.download_SRA method. Defaults to None. nproc (`int`, optional) – Number of processes for SRA download (default is 1, no parallelization).
Returns:	Downloaded data for each of the GSM
Return type:	`dict`

geotype = 'SERIES'¶

merge_and_average(platform, expression_column, group_by_column, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]¶

Merge and average GSE samples.

For given platform prepare the DataFrame with all the samples present in the GSE annotated with given column from platform and averaged over the column.

Parameters:	platform (`str` or `GEOparse.GPL`) – GPL platform to use. expression_column (`str`) – Column name in which “expressions” are represented group_by_column (`str`) – The data will be grouped and averaged over this column and only this column will be kept force (`bool`) – If the name of the GPL does not match the platform name in GSM proceed anyway merge_on_column (`str`) – Column to merge the data on - should be present in both GSM and GPL gsm_on (`str`) – In the case columns to merge are different in GSM and GPL use this column in GSM gpl_on (`str`) – In the case columns to merge are different in GSM and GPL use this column in GPL
Returns:	Merged and averaged table of results.
Return type:	`pandas.DataFrame`

phenotype_data¶: Get the phenotype data for each of the sample.

pivot_and_annotate(values, gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF')[source]¶

Annotate GSM with provided GPL.

Parameters:	values (`str`) – Column to use as values eg. “VALUES” gpl (`pandas.DataFrame` or `GEOparse.GPL`) – A Platform or DataFrame to annotate with. annotation_column (`str`) – Column in table for annotation. gpl_on (`str`, optional) – Use this column in GPL to merge. Defaults to “ID”. gsm_on (`str`, optional) – Use this column in GSM to merge. Defaults to “ID_REF”.
Returns:	Pivoted and annotated table of results
Return type:	pandas.DataFrame

pivot_samples(values, index='ID_REF')[source]¶

Pivot samples by specified column.

Construct a table in which columns (names) are the samples, index is a specified column eg. ID_REF and values in the columns are of one specified type.

Parameters:	values (`str`) – Column name present in all GSMs. index (`str`, optional) – Column name that will become an index in pivoted table. Defaults to “ID_REF”.
Returns:	Pivoted data
Return type:	`pandas.DataFrame`

class GEOparse.GEOTypes.GSM(name, metadata, table, columns)[source]¶

Bases: GEOparse.GEOTypes.SimpleGEO

Class that represents sample from GEO database.

Initialize simple GEO object.

Parameters:	name (`str`) – Name of the object metadata (`dict`) – Metadata information table (`pandas.DataFrame`) – Table with the data from SOFT file columns (`pandas.DataFrame`) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises:	ValueError – Table should be a DataFrame ValueError – Columns’ description should be a DataFrame DataIncompatibilityException – Columns are wrong ValueError – Description has to be present in columns

annotate(gpl, annotation_column, gpl_on='ID', gsm_on='ID_REF', in_place=False)[source]¶

Annotate GSM with provided GPL

Parameters:	gpl (`pandas.DataFrame`) – A Platform or DataFrame to annotate with annotation_column (str`) – Column in a table for annotation gpl_on (`str`) – Use this column in GSM to merge. Defaults to “ID”. gsm_on (`str`) – Use this column in GPL to merge. Defaults to “ID_REF”. in_place (`bool`) – Substitute table in GSM by new annotated table. Defaults to False.
Returns:	Annotated table or None
Return type:	`pandas.DataFrame` or `None`
Raises:	TypeError – GPL should be GPL or pandas.DataFrame

annotate_and_average(gpl, expression_column, group_by_column, rename=True, force=False, merge_on_column=None, gsm_on=None, gpl_on=None)[source]¶

Annotate GSM table with provided GPL.

Parameters:	gpl (`GEOTypes.GPL`) – Platform for annotations expression_column (`str`) – Column name which “expressions” are represented group_by_column (`str`) – The data will be grouped and averaged over this column and only this column will be kept rename (`bool`) – Rename output column to the self.name. Defaults to True. force (`bool`) – If the name of the GPL does not match the platform name in GSM proceed anyway. Defaults to False. merge_on_column (`str`) – Column to merge the data on. Defaults to None. gsm_on (`str`) – In the case columns to merge are different in GSM and GPL use this column in GSM. Defaults to None. gpl_on (`str`) – In the case columns to merge are different in GSM and GPL use this column in GPL. Defaults to None.
Returns:	Annotated data
Return type:	`pandas.DataFrame`

download_SRA(email, directory='./', **kwargs)[source]¶

Download RAW data as SRA file.

The files will be downloaded to the sample directory created ad hoc or the directory specified by the parameter. The sample has to come from sequencing eg. mRNA-seq, CLIP etc.

An important parameter is a filetype. By default an SRA is accessed by FTP and such file is downloaded. This does not require additional libraries. However in order to produce FASTA of FASTQ files one would need to use SRA-Toolkit. Thus, it is assumed that this library is already installed or it will be installed in the near future. One can immediately specify the download type to fasta or fastq.

To see all possible **kwargs that could be passed to the function see the description of SRADownloader.

Parameters:	email (`str`) – an email (any) - Required by NCBI for access directory (`str`, optional) – The directory to which download the data. Defaults to “./”. **kwargs – Arbitrary keyword arguments, see description
Returns:	A dictionary containing only one key (`SRA`) with the list of downloaded files.
Return type:	`dict`
Raises:	TypeError – Type to download unknown NoSRARelationException – No SRAToolkit Exception – Wrong e-mail HTTPError – Cannot access or connect to DB

download_supplementary_files(directory='./', download_sra=True, email=None, sra_kwargs=None)[source]¶

Download all supplementary data available for the sample.

Parameters:

directory (str) – Directory to download the data (in this directory function will create new directory with the files). Defaults to “./”.
download_sra (bool) – Indicates whether to download SRA raw data too. Defaults to True.
email (str) – E-mail that will be provided to the Entrez. It is mandatory if download_sra=True. Defaults to None.
sra_kwargs (dict, optional) – Kwargs passed to the download_SRA method. Defaults to None.

Returns:

A key-value pair of name taken from the metadata and: paths downloaded, in the case of SRA files the key is SRA.

Return type:

dict

geotype = 'SAMPLE'¶

exception GEOparse.GEOTypes.NoMetadataException[source]¶: Bases: exceptions.Exception

class GEOparse.GEOTypes.SimpleGEO(name, metadata, table, columns)[source]¶

Bases: GEOparse.GEOTypes.BaseGEO

Initialize simple GEO object.

Parameters:	name (`str`) – Name of the object metadata (`dict`) – Metadata information table (`pandas.DataFrame`) – Table with the data from SOFT file columns (`pandas.DataFrame`) – Description of the columns, number of columns, order and names represented as index in this DataFrame has to be the same as table.columns.
Raises:	ValueError – Table should be a DataFrame ValueError – Columns’ description should be a DataFrame DataIncompatibilityException – Columns are wrong ValueError – Description has to be present in columns

head()[source]¶: Print short description of the object.

show_columns()[source]¶: Print columns in SOFT format.

show_table(number_of_lines=5)[source]¶

Show few lines of the table the table as pandas.DataFrame.

Parameters:	number_of_lines (`int`) – Number of lines to show. Defaults to 5.

GEOparse.GEOparse module¶

exception GEOparse.GEOparse.NoEntriesException[source]¶

Bases: exceptions.Exception

Raised when no entries could be found in the SOFT file.

exception GEOparse.GEOparse.UnknownGEOTypeException[source]¶

Bases: exceptions.Exception

Raised when the GEO type that do not correspond to any known.

GEOparse.GEOparse.get_GEO(geo=None, filepath=None, destdir='./', how='full', annotate_gpl=False, geotype=None, include_data=False, silent=False, aspera=False, partial=None)[source]¶

Get the GEO entry.

The GEO entry is taken directly from the GEO database or read it from SOFT file.

Parameters:	geo (`str`) – GEO database identifier. filepath (`str`) – Path to local SOFT file. Defaults to None. destdir (`str`, optional) – Directory to download data. Defaults to None. how (`str`, optional) – GSM download mode. Defaults to “full”. annotate_gpl (`bool`, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False. geotype (`str`, optional) – Type of GEO entry. By default it is inferred from the ID or the file name. include_data (`bool`, optional) – Full download of GPLs including series and samples. Defaults to False. silent (`bool`, optional) – Do not print anything. Defaults to False. aspera (`bool`, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False. ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns:	A GEO object of given type.
Return type:	`GEOparse.BaseGEO`

GEOparse.GEOparse.get_GEO_file(geo, destdir=None, annotate_gpl=False, how='full', include_data=False, silent=False, aspera=False)[source]¶

Download corresponding SOFT file given GEO accession.

Parameters:	geo (`str`) – GEO database identifier. destdir (`str`, optional) – Directory to download data. Defaults to None. annotate_gpl (`bool`, optional) – Download the GPL annotation instead of regular GPL. If not available, fallback to regular GPL file. Defaults to False. how (`str`, optional) – GSM download mode. Defaults to “full”. include_data (`bool`, optional) – Full download of GPLs including series and samples. Defaults to False. silent (`bool`, optional) – Do not print anything. Defaults to False. aspera (`bool`, optional) – EXPERIMENTAL Download using Aspera Connect. Follow Aspera instructions for further details. Defaults to False.
Returns:	Path to downloaded file and and the type of GEO object.
Return type:	`2-tuple` of `str` and `str`

GEOparse.GEOparse.parse_GDS(filepath)[source]¶

Parse GDS SOFT file.

Parameters:	filepath (`str`) – Path to GDS SOFT file.
Returns:	A GDS object.
Return type:	`GEOparse.GDS`

GEOparse.GEOparse.parse_GDS_columns(lines, subsets)[source]¶

Parse list of line with columns description from SOFT file of GDS.

Parameters:	lines (`Iterable`) – Iterator over the lines. subsets (`dict` of `GEOparse.GDSSubset`) – Subsets to use.
Returns:	Columns description.
Return type:	`pandas.DataFrame`

GEOparse.GEOparse.parse_GPL(filepath, entry_name=None, partial=None)[source]¶

Parse GPL entry from SOFT file.

Parameters:	filepath (`str` or `Iterable`) – Path to file with 1 GPL entry or list of lines representing GPL from GSE file. entry_name (`str`, optional) – Name of the entry. By default it is inferred from the data. ( (partial) – obj:’iterable’, optional): A list of accession IDs of GSMs to be partially extracted from GPL, works only if a file/accession is a GPL.
Returns:	A GPL object.
Return type:	`GEOparse.GPL`

GEOparse.GEOparse.parse_GSE(filepath)[source]¶

Parse GSE SOFT file.

Parameters:	filepath (`str`) – Path to GSE SOFT file.
Returns:	A GSE object.
Return type:	`GEOparse.GSE`

GEOparse.GEOparse.parse_GSM(filepath, entry_name=None)[source]¶

Parse GSM entry from SOFT file.

Parameters:	filepath (`str` or `Iterable`) – Path to file with 1 GSM entry or list of lines representing GSM from GSE file. entry_name (`str`, optional) – Name of the entry. By default it is inferred from the data.
Returns:	A GSM object.
Return type:	`GEOparse.GSM`

GEOparse.GEOparse.parse_columns(lines)[source]¶

Parse list of lines with columns description from SOFT file.

Parameters:	lines (`Iterable`) – Iterator over the lines.
Returns:	Columns description.
Return type:	`pandas.DataFrame`

GEOparse.GEOparse.parse_entry_name(nameline)[source]¶

Parse line that starts with ^ and assign the name to it.

Parameters:	nameline (`str`) – A line to process.
Returns:	Entry name.
Return type:	`str`

GEOparse.GEOparse.parse_metadata(lines)[source]¶

Parse list of lines with metadata information from SOFT file.

Parameters:	lines (`Iterable`) – Iterator over the lines.
Returns:	Metadata from SOFT file.
Return type:	`dict`

GEOparse.GEOparse.parse_table_data(lines)[source]¶

“Parse list of lines from SOFT file into DataFrame.

Parameters:	lines (`Iterable`) – Iterator over the lines.
Returns:	Table data.
Return type:	`pandas.DataFrame`

GEOparse.logger module¶

GEOparse.logger.set_verbosity(level)[source]¶

Set the log level.

Parameters:	level (`str`) – Level name eg. DEBUG or ERROR

GEOparse.logger.add_log_file(path)[source]¶

Add log file.

Parameters:	path (`str`) – Path to the log file.

GEOparse.utils module¶

GEOparse.utils.download_from_url(url, destination_path, force=False, aspera=False, silent=False)[source]¶

Download file from remote server.

If the file is already downloaded and force flag is on the file will be removed.

Parameters:	url (`str`) – Path to the file on remote server (including file name) destination_path (`str`) – Path to the file on local machine (including file name) force (`bool`) – If file exist force to overwrite it. Defaults to False. aspera (`bool`) – Download with Aspera Connect. Defaults to False. silent (`bool`) – Do not print any message. Defaults to False.

GEOparse.utils.mkdir_p(path_to_dir)[source]¶

Make directory(ies).

This function behaves like mkdir -p.

Parameters:	path_to_dir (`str`) – Path to the directory to make.

GEOparse.utils.smart_open(*args, **kwds)[source]¶

Open file intelligently depending on the source and python version.

Parameters:	filepath (`str`) – Path to the file.
Yields:	Context manager for file handle.

GEOparse.utils.which(program)[source]¶

Check if executable exists.

The code is taken from: https://stackoverflow.com/questions/377017/test-if-executable-exists-in-python :param program: Path to the executable. :type program: str

Returns:	Path to the program or None.
Return type:	`str` or `None`

GEOparse package¶

Submodules¶

GEOparse.GEOTypes module¶

GEOparse.GEOparse module¶

GEOparse.logger module¶

GEOparse.utils module¶

Module contents¶