StaffPaolo Cozzi


Informations



E-mail
cozzi@ibba.cnr.it

Phone
+39 0223499477

Office
Milano

Research area
BIOGEN


ORCID: 0000-0003-0388-6874
Research Gate: Paolo Cozzi
Linkedin: Paolo Cozzi

Cozzi Paolo

Researcher

Education

2007-2010: PhD in Molecular Medicine, University of Milano

2004-2007: Specialist degree in Bioinformatics, University of Milano-Bicocca

2000-2004: Degree in Biotechnologies, University of Milano-Bicocca

Professional experience

2022-Present: Research at IBBA-CNR

2019–2021: Graduated Research Fellow at IBBA-CNR

2018-20129: Graduated Research Fellow at ITB-CNR

2010-2018: Bioinformatician at Parco Tecnologico Padano

2007-2010: Bioinformatician at ITB-CNR

Research interests

  • Bioinformatics
  • Genomics
  • Metagenomics
  • Genetic Annotation
  • HPC
  • Cloud Computing
  • Programming Languages
  • Databases
  • Algorithms
  • AI
  • Data Visualization

Projects in progress

Sheep-TreeSeq: Analisi scalabile della diversità genetica ovina usando alberi (grafi) di sequenze genomiche
Start date: 01/12/2023   End date: 30/11/2025

CNR/Royal Society (Biennio 2024-2025)

Milano
Filippo Biscarini

Project duration:
01/12/2023 - 30/11/2025
Financing body:
CNR/Royal Society (Biennio 2024-2025)
Project research leader:
Filippo Biscarini
Headquarters:
Milano

Sheep-TreeSeq: Analisi scalabile della diversità genetica ovina usando alberi (grafi) di sequenze genomiche


Sheep-TreeSeq will perform scalable analysis of genomic diversity of sheep global populations using the novel tree sequence data format and methodology.

Technological advances in agritech have increased the availability of genomic data, leading to massive datasets (“big data”) which pose challenges for storage, processing and analysis, e.g. the sheer volume of the data, the rapid generation of new data (updating results, expanding training populations, streaming applications), and the heterogeneity of data sources (integration of data from multiple sequencing and genotyping platforms). The tree sequence algorithm offers an excellent way to address such challenges, by providing lossless compression and novel representation of the data. As an example, using tree sequences on the 1000 Bull Genome Project data a 90% lossless compression was obtained, reducing the data size from ~800 GB to 45 GB. For the Sheep TreeSeq project we will use around 3,500 sheep whole genome sequences and over 50,000 genotypes (~10 TB of data).

Our plan is to apply the tree sequence approach to compress the data and obtain a data representation highly suited for population genetics and demographic analysis: (i) principal component and genealogical nearest neighbour clustering; (ii) fixation index measuring genetic differentiation; (iii) deep neural network based clustering methods; iv) detection of runs of homozygosity (ROH) and heterozygosity-rich regions (HRR).

This is the first time that this approach is applied to sheep genomics.

Projects completed

Istituto

Search website