PyPop.ParseFile#
Parsing input population data files.
Includes ParseGenotypeFile
for parsing individuals genotyped
at multiple loci and ParseAlleleCountFile
for parsing
literature data which only includes allele counts.
Both file formats are assumed to have a population header information with, consisting of a line of column headers (population metadata) followed by a line with the actual data, followed by the column headers for the samples (sample metadata) followed by the sample data itself (either individuals in the genotyped case, or alleles in the allele count case).
Classes#
Common functionality for reading the two file formats. |
|
Class to parse standard datafile in genotype form. |
|
Class to parse datafile in allele count form. |
Module Contents#
- class ParseFile(filename, validPopFields=None, validSampleFields=None, separator='\t', fieldPairDesignator='_1:_2', alleleDesignator='*', popNameDesignator='+', debug=0)#
Common functionality for reading the two file formats.
Base class.
- Parameters:
filename (str) – filename for the file to be parsed.
validPopFields (str) – valid headers (one per line) for overall population data (no default)
validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)
separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).
fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and
HLA-A(2)
, then we use:(2)
, forDQA1_1
andDQA1_2
, then use_1:_2
, the latter case distinguishes both fields from the stem] (default::(2)
)alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to
*
popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to
+
debug (int, optional) – Switches debugging on if set to
1
(default: no debugging,0
)
- getPopData()#
Returns a dictionary of population data.
- Returns:
keyed by types specified in population metadata file
- Return type:
- getSampleMap()#
Returns dictionary of sample data.
- Returns:
- each entry contains either a 2-tuple of column
position or a single column position keyed by field originally specified in sample metadata file
- Return type:
- getFileData()#
Returns the file data.
- Returns:
a 2-tuple “wrapper”:
str: raw sample lines, without header metadata.
str: the field separator.
- Return type:
- genSampleOutput(fieldList)#
Prints the data specified in ordered field list.
Deprecated since version 0.7.0.
- serializeMetadataTo(stream)#
Write metadata to stream.
- Parameters:
stream (XMLStreamOutput) – output stream
- class ParseGenotypeFile(filename, untypedAllele='****', **kw)#
Bases:
ParseFile
Class to parse standard datafile in genotype form.
Processes files that consist specifically of data with individual genotyped for one or more loci.
- Parameters:
Base class.
- Parameters:
filename (str) – filename for the file to be parsed.
validPopFields (str) – valid headers (one per line) for overall population data (no default)
validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)
separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).
fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and
HLA-A(2)
, then we use:(2)
, forDQA1_1
andDQA1_2
, then use_1:_2
, the latter case distinguishes both fields from the stem] (default::(2)
)alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to
*
popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to
+
debug (int, optional) – Switches debugging on if set to
1
(default: no debugging,0
)
- genValidKey(field, fieldList)#
Check and validate key.
‘field’: string with field name.
‘fieldList’: a dictionary of valid fields.
Check to see whether ‘field’ is a valid key, and generate the appropriate ‘key’. Returns a 2-tuple consisting of ‘isValidKey’ boolean and the ‘key’.
Note: this is explicitly done in the subclass of the abstract ‘ParseFile’ class (i.e. since this subclass should have `knowledge’ about the nature of fields, but the abstract class should not have)
- getMatrix()#
Returns the genotype data.
Returns the genotype data in a ‘StringMatrix’ NumPy array.
- serializeSubclassMetadataTo(stream)#
Serialize subclass-specific metadata.
- class ParseAlleleCountFile(filename, **kw)#
Bases:
ParseFile
Class to parse datafile in allele count form.
Input files consist of allele counts across a whole population. Currently only handles one locus per population. Example:
<metadata-line1> <metadata-line2> DQA1 count 0102 20 0103 33 ...
Base class.
- Parameters:
filename (str) – filename for the file to be parsed.
validPopFields (str) – valid headers (one per line) for overall population data (no default)
validSampleFields (str) – valid headers (one per line) for lines of sample data. (no default)
separator (str, optional) – separator for adjacent fields (default: a tab stop, ‘\t’).
fieldPairDesignator (str, optional) – consists of additions to the allele stem’ for fields grouped in pairs (allele fields) [e.g. for ``HLA-A’, and
HLA-A(2)
, then we use:(2)
, forDQA1_1
andDQA1_2
, then use_1:_2
, the latter case distinguishes both fields from the stem] (default::(2)
)alleleDesignator (str, optional) – first character of the key which determines whether this column contains allele data. Defaults to
*
popNameDesignator (str, optional) – first character of the key which determines whether this column contains the population name. Defaults to
+
debug (int, optional) – Switches debugging on if set to
1
(default: no debugging,0
)
- genValidKey(field, fieldList)#
Checks validity of a field.
- Parameters:
- Returns:
2-tuple of:
boolean: whether key is valid
str: key
- Return type:
Note
The first element in the
fieldList
is a locus name, which may contain many loci (delimited by colons:
). Iffield
in the input file match any of these keys , this method will return the field and a valid match.Example
If the first element of
fieldList
isDQA1:DRA:DQB1
, then calling this function withfield
set toDRA
, this would return(True, DRA)
- serializeSubclassMetadataTo(stream)#
Serialize subclass specific metadata.
- Parameters:
stream (XMLOutputStream) – output stream
- getAlleleTable()#
Get the current allele table.
- Returns:
keyed by allele name with value count
- Return type:
- getMatrix()#
Get the full genotype data.
- Returns:
containing all the genotype data
- Return type: