8.12. Tabular data#
The cogent3
Table
data type provides methods for manipulating tabular data. Here we summarise the basic capabilities. One important point to make is that tables are immutable, in that once a column has been created you cannot modify individual elements. But you can modify what columns a table has, their name and order. Table columns themselves are just numpy arrays. (See the cogent3 cookbook for a more thorough description.)
8.12.1. Making a Table from standard Python objects#
8.12.1.1. From a column-oriented dict#
We convert dinucleotide data into a column based dict
– the keys will become the column names and the values will be the column.
data = {"base1": base1, "base2": base2}
We make a cogent3
table using a utility function.
from cogent3 import make_table
table = make_table(data=data)
table
base1 | base2 |
---|---|
N | N |
T | G |
C | C |
A | G |
T | T |
... | ... |
C | T |
G | A |
T | G |
A | C |
C | T |
Top 5 and bottom 5 rows from 1,297 rows x 2 columns
8.12.1.2. From a header and a list of lists#
I’m going to make a list of lists from dinucs
by just convert each dinucleotide into a list.
rows = [list(dinuc) for dinuc in dinucs]
rows[:4]
[['N', 'N'], ['T', 'G'], ['C', 'C'], ['A', 'G']]
In this instance, I need to specify the column names using the argument header
.
table = make_table(header=["base1", "base2"], data=rows)
table
base1 | base2 |
---|---|
N | N |
T | G |
C | C |
A | G |
T | T |
... | ... |
C | T |
G | A |
T | G |
A | C |
C | T |
Top 5 and bottom 5 rows from 1,297 rows x 2 columns
8.12.2. Loading a table from a file#
We load a tab separated data file using the load_table()
function. The format is inferred from the filename suffix.
from cogent3 import load_table
stats = load_table("data/edge_stats.tsv")
stats
edge | kappa |
---|---|
FlyingFox | 8.5680 |
DogFaced | 7.6584 |
edge.0 | 4.6557 |
FreeTaile | 4.1230 |
LittleBro | 6.0653 |
edge.1 | 4.5488 |
TombBat | 6.7665 |
RoundEare | 5.8395 |
edge.2 | 5.2402 |
edge.3 | 8.2848 |
10 rows x 2 columns
8.12.3. Getting summary using Table.head()
or Table.tail()
#
These display the top or bottom of a table.
stats.head()
edge | kappa |
---|---|
FlyingFox | 8.5680 |
DogFaced | 7.6584 |
edge.0 | 4.6557 |
FreeTaile | 4.1230 |
LittleBro | 6.0653 |
Top 5 rows from 10 rows x 2 columns
stats.tail()
edge | kappa |
---|---|
edge.1 | 4.5488 |
TombBat | 6.7665 |
RoundEare | 5.8395 |
edge.2 | 5.2402 |
edge.3 | 8.2848 |
Bottom 5 rows from 10 rows x 2 columns
8.12.4. Slicing a Table#
Tables are “row oriented”, so the first index concerns rows, the column.
stats[:4]
edge | kappa |
---|---|
FlyingFox | 8.5680 |
DogFaced | 7.6584 |
edge.0 | 4.6557 |
FreeTaile | 4.1230 |
4 rows x 2 columns
8.12.5. Getting a column#
Tables have a column
attribute.
stats.columns
Columns('edge': <U9, 'kappa': float64)
This has dict
like properties and supports you getting a column using the column header,
stats.columns["kappa"]
array([8.5679832 , 7.65839569, 4.6557154 , 4.12300489, 6.06528905,
4.54883285, 6.76650876, 5.83950315, 5.24024651, 8.28483208])
or, using an int
like it’s a series – in this case indexes are defined by the column.order
attribute.
stats.columns.order
('edge', 'kappa')
stats.columns[0]
array(['FlyingFox', 'DogFaced', 'edge.0', 'FreeTaile', 'LittleBro',
'edge.1', 'TombBat', 'RoundEare', 'edge.2', 'edge.3'], dtype='<U9')
8.12.6. Creating a new column#
You write a function that takes the rows from the columns ytou want and returns the result of some operation. I’ll just take the square root of kappa.
from math import sqrt
k_rt = stats.with_new_column("sqrt(kappa)", lambda x: sqrt(x), columns=["kappa"])
k_rt.head()
edge | kappa | sqrt(kappa) |
---|---|---|
FlyingFox | 8.5680 | 2.9271 |
DogFaced | 7.6584 | 2.7674 |
edge.0 | 4.6557 | 2.1577 |
FreeTaile | 4.1230 | 2.0305 |
LittleBro | 6.0653 | 2.4628 |
Top 5 rows from 10 rows x 3 columns
8.12.7. Filtering a table to include rows by value#
In our dinucleotide table, we only want rows where both bases are in the canonical set {A, C, G, T}
. We do this via a filter step using a lambda
function and a set
object consisting of these bases. The filtered()
method calls the lambda
with each row. Only if the lambda
returns True
will the row be included in the new Table
. In our case, our lambda
will return true if the set of elements in the row is a subset of all the basses.
table = table.filtered(lambda x: set(x) <= {"A", "C", "G", "T"})
table
base1 | base2 |
---|---|
T | G |
C | C |
A | G |
T | T |
C | A |
... | ... |
C | T |
G | A |
T | G |
A | C |
C | T |
Top 5 and bottom 5 rows from 1,296 rows x 2 columns
Note
I did not specify which columns because the default is to use all columns.
8.12.8. Counting unique values#
This method returns counts of the unique combinations of values from the specified columns. The result is a cogent3
type, a CategoryCounter
instance, which has some useful properties. Principal being that it behaves like a dict
.
unique = table.count_unique()
unique
CategoryCounter({('T', 'G'): 87, ('C', 'C'): 81, ('A', 'G'): 142, ('T', 'T'): 58, ('C', 'A'): 110, ('A', 'C'): 66, ('T', 'A'): 59, ('A', 'A'): 158, ('G', 'C'): 65, ('G', 'A'): 113, ('A', 'T'): 81, ('G', 'T'): 63, ('G', 'G'): 69, ('C', 'T'): 72, ('T', 'C'): 57, ('C', 'G'): 15})
8.12.8.1. To categorical count#
Another being that it can produce CategoryCount
object
cat_counts = unique.to_categorical()
cat_counts
A | C | G | T | |
---|---|---|---|---|
A | 158 | 66 | 142 | 81 |
C | 110 | 81 | 15 | 72 |
G | 113 | 65 | 69 | 63 |
T | 59 | 57 | 87 | 58 |
A | C | G | T | |
---|---|---|---|---|
A | 151.7593 | 92.7801 | 107.9560 | 94.5046 |
C | 94.3827 | 57.7022 | 67.1404 | 58.7747 |
G | 105.2469 | 64.3441 | 74.8688 | 65.5401 |
T | 88.6111 | 54.1736 | 63.0347 | 55.1806 |
A | C | G | T | |
---|---|---|---|---|
A | 0.5066 | -2.7803 | 3.2766 | -1.3892 |
C | 1.6075 | 3.0670 | -6.3633 | 1.7251 |
G | 0.7557 | 0.0818 | -0.6783 | -0.3138 |
T | -3.1457 | 0.3840 | 3.0185 | 0.3796 |
which supports statistical testing of categorical data. For instance
cat_counts.chisq_test()
chisq | df | pvalue |
---|---|---|
96.545 | 9 | 7.85e-17 |
A | C | G | T | |
---|---|---|---|---|
A | 158 | 66 | 142 | 81 |
C | 110 | 81 | 15 | 72 |
G | 113 | 65 | 69 | 63 |
T | 59 | 57 | 87 | 58 |
A | C | G | T | |
---|---|---|---|---|
A | 151.7593 | 92.7801 | 107.9560 | 94.5046 |
C | 94.3827 | 57.7022 | 67.1404 | 58.7747 |
G | 105.2469 | 64.3441 | 74.8688 | 65.5401 |
T | 88.6111 | 54.1736 | 63.0347 | 55.1806 |
A | C | G | T | |
---|---|---|---|---|
A | 0.5066 | -2.7803 | 3.2766 | -1.3892 |
C | 1.6075 | 3.0670 | -6.3633 | 1.7251 |
G | 0.7557 | 0.0818 | -0.6783 | -0.3138 |
T | -3.1457 | 0.3840 | 3.0185 | 0.3796 |
8.12.9. Generating categorical counts from a Table#
We can also get a CategoryCount
object via Table.to_categorical()
. In this case, the counts must be fully specified prior to constructing the table. (Meaning you’ve already done the counting part.) In addition, we also need to specify a column whose values are the row categories. The latter is achieved setting make_table(index_name=<column name>)
. In this case, I specify the column name of the index is an empty string.
data = {
"A": (158, 110, 113, 59),
"C": (66, 81, 65, 57),
"G": (142, 15, 69, 87),
"T": (81, 72, 63, 58),
"": ["A", "C", "G", "T"],
}
table = make_table(data=data, index_name="")
table
A | C | G | T | |
---|---|---|---|---|
A | 158 | 66 | 142 | 81 |
C | 110 | 81 | 15 | 72 |
G | 113 | 65 | 69 | 63 |
T | 59 | 57 | 87 | 58 |
4 rows x 5 columns
cat_counts = table.to_categorical()
cat_counts
A | C | G | T | |
---|---|---|---|---|
A | 158 | 66 | 142 | 81 |
C | 110 | 81 | 15 | 72 |
G | 113 | 65 | 69 | 63 |
T | 59 | 57 | 87 | 58 |
A | C | G | T | |
---|---|---|---|---|
A | 151.7593 | 92.7801 | 107.9560 | 94.5046 |
C | 94.3827 | 57.7022 | 67.1404 | 58.7747 |
G | 105.2469 | 64.3441 | 74.8688 | 65.5401 |
T | 88.6111 | 54.1736 | 63.0347 | 55.1806 |
A | C | G | T | |
---|---|---|---|---|
A | 0.5066 | -2.7803 | 3.2766 | -1.3892 |
C | 1.6075 | 3.0670 | -6.3633 | 1.7251 |
G | 0.7557 | 0.0818 | -0.6783 | -0.3138 |
T | -3.1457 | 0.3840 | 3.0185 | 0.3796 |