PyPop.ParseFile#

Parsing input population data files.

Includes ParseGenotypeFile for parsing individuals genotyped at multiple loci and ParseAlleleCountFile for parsing literature data which only includes allele counts.

Both file formats are assumed to have a population header information with, consisting of a line of column headers (population metadata) followed by a line with the actual data, followed by the column headers for the samples (sample metadata) followed by the sample data itself (either individuals in the genotyped case, or alleles in the allele count case).

Classes#

ParseFile

Common functionality for reading the two file formats.

ParseGenotypeFile

Class to parse standard datafile in genotype form.

ParseAlleleCountFile

Class to parse datafile in allele count form.

Module Contents#

class ParseFile(filename, validPopFields=None, validSampleFields=None, separator='\t', fieldPairDesignator='_1:_2', alleleDesignator='*', popNameDesignator='+', debug=0)#

Common functionality for reading the two file formats.

Base class.

Parameters:
  • filename (str) – filename for the file to be parsed.

  • validPopFields (str) – valid headers (one per line) for overall population data (no default)

  • validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)

  • separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).

  • fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and HLA-A(2), then we use :(2), for DQA1_1 and DQA1_2, then use _1:_2, the latter case distinguishes both fields from the stem] (default: :(2))

  • alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to *

  • popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to +

  • debug (int, optional) – Switches debugging on if set to 1 (default: no debugging, 0)

getPopData()#

Returns a dictionary of population data.

Returns:

keyed by types specified in population metadata file

Return type:

dict

getSampleMap()#

Returns dictionary of sample data.

Returns:

each entry contains either a 2-tuple of column

position or a single column position keyed by field originally specified in sample metadata file

Return type:

dict

getFileData()#

Returns the file data.

Returns:

a 2-tuple “wrapper”:

  • str: raw sample lines, without header metadata.

  • str: the field separator.

Return type:

tuple

genSampleOutput(fieldList)#

Prints the data specified in ordered field list.

Deprecated since version 0.7.0.

serializeMetadataTo(stream)#

Write metadata to stream.

Parameters:

stream (XMLStreamOutput) – output stream

class ParseGenotypeFile(filename, untypedAllele='****', **kw)#

Bases: ParseFile

Inheritance diagram of PyPop.ParseFile.ParseGenotypeFile

Class to parse standard datafile in genotype form.

Processes files that consist specifically of data with individual genotyped for one or more loci.

Parameters:
  • filename (str) – filename for the file to be parsed.

  • untypedAllele (str, optional) – The designator for an untyped locus. Defaults to ****.

Base class.

Parameters:
  • filename (str) – filename for the file to be parsed.

  • validPopFields (str) – valid headers (one per line) for overall population data (no default)

  • validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)

  • separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).

  • fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and HLA-A(2), then we use :(2), for DQA1_1 and DQA1_2, then use _1:_2, the latter case distinguishes both fields from the stem] (default: :(2))

  • alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to *

  • popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to +

  • debug (int, optional) – Switches debugging on if set to 1 (default: no debugging, 0)

genValidKey(field, fieldList)#

Check and validate key.

  • ‘field’: string with field name.

  • ‘fieldList’: a dictionary of valid fields.

Check to see whether ‘field’ is a valid key, and generate the appropriate ‘key’. Returns a 2-tuple consisting of ‘isValidKey’ boolean and the ‘key’.

Note: this is explicitly done in the subclass of the abstract ‘ParseFile’ class (i.e. since this subclass should have `knowledge’ about the nature of fields, but the abstract class should not have)

getMatrix()#

Returns the genotype data.

Returns the genotype data in a ‘StringMatrix’ NumPy array.

serializeSubclassMetadataTo(stream)#

Serialize subclass-specific metadata.

class ParseAlleleCountFile(filename, **kw)#

Bases: ParseFile

Inheritance diagram of PyPop.ParseFile.ParseAlleleCountFile

Class to parse datafile in allele count form.

Input files consist of allele counts across a whole population. Currently only handles one locus per population. Example:

<metadata-line1>
<metadata-line2>
DQA1 count
0102 20
0103 33
...

Base class.

Parameters:
  • filename (str) – filename for the file to be parsed.

  • validPopFields (str) – valid headers (one per line) for overall population data (no default)

  • validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)

  • separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).

  • fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and HLA-A(2), then we use :(2), for DQA1_1 and DQA1_2, then use _1:_2, the latter case distinguishes both fields from the stem] (default: :(2))

  • alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to *

  • popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to +

  • debug (int, optional) – Switches debugging on if set to 1 (default: no debugging, 0)

genValidKey(field, fieldList)#

Checks validity of a field.

Parameters:
  • field (str) – field to check

  • fieldList (str) – list that field is checked against

Returns:

2-tuple of:

  • boolean: whether key is valid

  • str: key

Return type:

tuple

Note

The first element in the fieldList is a locus name, which may contain many loci (delimited by colons :). If field in the input file match any of these keys , this method will return the field and a valid match.

Example

If the first element of fieldList is DQA1:DRA:DQB1, then calling this function with field set to DRA, this would return (True, DRA)

serializeSubclassMetadataTo(stream)#

Serialize subclass specific metadata.

Parameters:

stream (XMLOutputStream) – output stream

getAlleleTable()#

Get the current allele table.

Returns:

keyed by allele name with value count

Return type:

dict

getLocusName()#

Get the locus name.

Returns:

locus name

Return type:

str

getMatrix()#

Get the full genotype data.

Returns:

containing all the genotype data

Return type:

StringMatrix