Close printable page

Recommendation

An open-source pipeline to reconstruct phylogenies with paleoproteomic data

Leslea Hlusko based on reviews by Katerina Douka and 2 anonymous reviewers

A recommendation of:

PaleoProPhyler: a reproducible pipeline for phylogenetic inference using ancient proteins

Ioannis Patramanis, Jazmín Ramos-Madrigal, Enrico Cappellini, Fernando Racimo (2023), bioRxiv, ver.3, peer-reviewed and recommended by PCI Paleontology https://doi.org/10.1101/2022.12.12.519721

Read preprint in preprint server Now published in Peer Community Journal

Data used for results

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

PaleoProPhyler: a reproducible pipeline for phylogenetic inference using ancient proteins

Ancient proteins from fossilized or semi-fossilized remains can yield phylogenetic information at broad temporal horizons, in some cases even millions of years into the past. In recent years, peptides extracted from archaic hominins and long-extinct mega-fauna have enabled unprecedented insights into their evolutionary history. In contrast to the field of ancient DNA - where several computational methods exist to process and analyze sequencing data - few tools exist for handling ancient protein sequence data. Instead, most studies rely on loosely combined custom scripts, which makes it difficult to reproduce results or share methodologies across research groups. Here, we present PaleoProPhyler: a new fully reproducible pipeline for aligning ancient peptide data and subsequently performing phylogenetic analyses. The pipeline can not only process various forms of proteomic data, but also easily harness genetic data in different formats (CRAM, BAM, VCF) and translate it, allowing the user to create reference panels for phyloproteomic analyses. We describe the various steps of the pipeline and its many functionalities, and provide some examples of how to use it. PaleoProPhyler allows researchers with little bioinformatics experience to efficiently analyze palaeoproteomic sequences, so as to derive insights from this valuable source of evolutionary data.

Pipeline, Workflow, Palaeoproteomics, Paleoproteomics, Phylogenetics, Phyloproteomics, Hominid evolution , Human evolution

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler: خط أنابيب قابل للتكرار للاستدلال التطوري باستخدام البروتينات القديمة

يمكن للبروتينات القديمة من البقايا المتحجرة أو شبه المتحجرة أن تسفر عن معلومات تطور السلالات على آفاق زمنية واسعة، وفي بعض الحالات قد تصل إلى ملايين السنين في الماضي. في السنوات الأخيرة، أتاحت الببتيدات المستخرجة من أشباه البشر القدماء والحيوانات الضخمة المنقرضة منذ فترة طويلة رؤى غير مسبوقة في تاريخهم التطوري. وعلى النقيض من مجال الحمض النووي القديم - حيث توجد العديد من الأساليب الحسابية لمعالجة وتحليل بيانات التسلسل - توجد أدوات قليلة للتعامل مع بيانات تسلسل البروتين القديم. وبدلاً من ذلك، تعتمد معظم الدراسات على نصوص مخصصة مجمعة بشكل فضفاض، مما يجعل من الصعب إعادة إنتاج النتائج أو مشاركة المنهجيات عبر مجموعات البحث. هنا، نقدم PaleoProPhyler: خط أنابيب جديد قابل للتكرار بالكامل لمواءمة بيانات الببتيد القديمة وإجراء تحليلات النشوء والتطور لاحقًا. لا يستطيع خط الأنابيب معالجة أشكال مختلفة من البيانات البروتينية فحسب، بل يمكنه أيضًا تسخير البيانات الوراثية بتنسيقات مختلفة (CRAM، BAM، VCF) وترجمتها، مما يسمح للمستخدم بإنشاء لوحات مرجعية لتحليلات البروتينات الوراثية. نحن نصف الخطوات المختلفة لخط الأنابيب ووظائفه العديدة، ونقدم بعض الأمثلة حول كيفية استخدامه. يسمح PaleoProPhyler للباحثين ذوي الخبرة القليلة في مجال المعلوماتية الحيوية بتحليل تسلسلات البروتينات القديمة بكفاءة، وذلك لاستخلاص الأفكار من هذا المصدر القيم للبيانات التطورية.

خط الأنابيب، سير العمل، علم البروتينات القديمة، علم البروتينات القديمة، علم الوراثة، علم البروتينات النباتية، تطور الإنسان، التطور البشري

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler: un canal reproducible para la inferencia filogenética utilizando proteínas antiguas

Las proteínas antiguas de restos fosilizados o semifosilizados pueden proporcionar información filogenética en amplios horizontes temporales, en algunos casos incluso millones de años atrás. En los últimos años, los péptidos extraídos de homínidos arcaicos y de megafauna extinta hace mucho tiempo han permitido obtener conocimientos sin precedentes sobre su historia evolutiva. A diferencia del campo del ADN antiguo, donde existen varios métodos computacionales para procesar y analizar datos de secuenciación, existen pocas herramientas para manejar datos de secuencias de proteínas antiguas. En cambio, la mayoría de los estudios se basan en guiones personalizados poco combinados, lo que dificulta reproducir resultados o compartir metodologías entre grupos de investigación. Aquí presentamos PaleoProPhyler: una nueva tubería totalmente reproducible para alinear datos de péptidos antiguos y posteriormente realizar análisis filogenéticos. El proceso no solo puede procesar diversas formas de datos proteómicos, sino que también puede aprovechar fácilmente datos genéticos en diferentes formatos (CRAM, BAM, VCF) y traducirlos, lo que permite al usuario crear paneles de referencia para análisis filoproteómicos. Describimos los distintos pasos del proceso y sus numerosas funcionalidades, y proporcionamos algunos ejemplos de cómo utilizarlo. PaleoProPhyler permite a los investigadores con poca experiencia en bioinformática analizar de manera eficiente secuencias paleoproteómicas para obtener información a partir de esta valiosa fuente de datos evolutivos.

Pipeline, Flujo de trabajo, Paleoproteómica, Paleoproteómica, Filogenética, Filoproteómica, Evolución de los homínidos, Evolución humana

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler : un pipeline reproductible pour l'inférence phylogénétique utilisant des protéines anciennes

Les protéines anciennes provenant de restes fossilisés ou semi-fossilisés peuvent fournir des informations phylogénétiques à de larges horizons temporels, dans certains cas même des millions d'années dans le passé. Ces dernières années, les peptides extraits d’hominines archaïques et de mégafaune disparue depuis longtemps ont permis d’obtenir des informations sans précédent sur leur histoire évolutive. Contrairement au domaine de l’ADN ancien – où plusieurs méthodes informatiques existent pour traiter et analyser les données de séquençage – il existe peu d’outils pour gérer les données de séquences de protéines anciennes. Au lieu de cela, la plupart des études s'appuient sur des scripts personnalisés vaguement combinés, ce qui rend difficile la reproduction des résultats ou le partage de méthodologies entre les groupes de recherche. Nous présentons ici PaleoProPhyler : un nouveau pipeline entièrement reproductible permettant d’aligner d’anciennes données peptidiques et d’effectuer ensuite des analyses phylogénétiques. Le pipeline peut non seulement traiter diverses formes de données protéomiques, mais également exploiter facilement les données génétiques dans différents formats (CRAM, BAM, VCF) et les traduire, permettant à l'utilisateur de créer des panels de référence pour les analyses phyloprotéomiques. Nous décrivons les différentes étapes du pipeline et ses nombreuses fonctionnalités, et donnons quelques exemples d'utilisation. PaleoProPhyler permet aux chercheurs ayant peu d'expérience en bioinformatique d'analyser efficacement les séquences paléoprotéomiques, afin de tirer des enseignements de cette source précieuse de données évolutives.

Pipeline, Workflow, Paléoprotéomique, Paléoprotéomique, Phylogénétique, Phyloprotéomique, Évolution des hominidés, Évolution humaine

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

पैलियोप्रोफाइलर: प्राचीन प्रोटीन का उपयोग करके फ़ाइलोजेनेटिक अनुमान के लिए एक प्रतिलिपि प्रस्तुत करने योग्य पाइपलाइन

जीवाश्म या अर्ध-जीवाश्म अवशेषों से प्राचीन प्रोटीन व्यापक अस्थायी क्षितिज पर फ़ाइलोजेनेटिक जानकारी प्राप्त कर सकते हैं, कुछ मामलों में यहां तक कि लाखों साल पहले भी। हाल के वर्षों में, पुरातन होमिनिन और लंबे समय से विलुप्त मेगा-जीवों से निकाले गए पेप्टाइड्स ने उनके विकासवादी इतिहास में अभूतपूर्व अंतर्दृष्टि सक्षम की है। प्राचीन डीएनए के क्षेत्र के विपरीत - जहां अनुक्रमण डेटा को संसाधित करने और विश्लेषण करने के लिए कई कम्प्यूटेशनल तरीके मौजूद हैं - प्राचीन प्रोटीन अनुक्रम डेटा को संभालने के लिए कुछ उपकरण मौजूद हैं। इसके बजाय, अधिकांश अध्ययन शिथिल रूप से संयुक्त कस्टम स्क्रिप्ट पर निर्भर करते हैं, जिससे परिणामों को पुन: प्रस्तुत करना या अनुसंधान समूहों में कार्यप्रणाली साझा करना मुश्किल हो जाता है। यहां, हम पैलियोप्रोफाइलर प्रस्तुत करते हैं: प्राचीन पेप्टाइड डेटा को संरेखित करने और बाद में फ़ाइलोजेनेटिक विश्लेषण करने के लिए एक नई पूरी तरह से प्रतिलिपि प्रस्तुत करने योग्य पाइपलाइन। पाइपलाइन न केवल प्रोटिओमिक डेटा के विभिन्न रूपों को संसाधित कर सकती है, बल्कि विभिन्न प्रारूपों (सीआरएएम, बीएएम, वीसीएफ) में आनुवंशिक डेटा का आसानी से उपयोग कर सकती है और इसका अनुवाद कर सकती है, जिससे उपयोगकर्ता को फ़ाइलोप्रोटेमिक विश्लेषण के लिए संदर्भ पैनल बनाने की अनुमति मिलती है। हम पाइपलाइन के विभिन्न चरणों और इसकी कई कार्यात्मकताओं का वर्णन करते हैं, और इसका उपयोग कैसे करें इसके कुछ उदाहरण प्रदान करते हैं। पैलियोप्रोफाइलर कम जैव सूचना विज्ञान अनुभव वाले शोधकर्ताओं को पैलियोप्रोटेमिक अनुक्रमों का कुशलतापूर्वक विश्लेषण करने की अनुमति देता है, ताकि विकासवादी डेटा के इस मूल्यवान स्रोत से अंतर्दृष्टि प्राप्त की जा सके।

पाइपलाइन, वर्कफ़्लो, पुराप्रोटिओमिक्स, पुराप्रोटिओमिक्स, फ़ाइलोजेनेटिक्स, फ़ाइलोप्रोटिओमिक्स, होमिनिड विकास, मानव विकास

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler: 古代タンパク質を使用した系統推論のための再現可能なパイプライン

化石化または半化石化した遺跡からの古代のタンパク質は、広い時間的視野で、場合によっては数百万年前の系統発生情報を得ることができます。近年、旧人類や長く絶滅した巨大動物から抽出されたペプチドにより、その進化の歴史について前例のない洞察が可能になりました。配列データを処理および分析するためのいくつかの計算手法が存在する古代 DNA の分野とは対照的に、古代のタンパク質配列データを処理するためのツールはほとんど存在しません。代わりに、ほとんどの研究は緩やかに結合されたカスタムスクリプトに依存しているため、結果を再現したり、研究グループ間で方法論を共有したりすることが困難になります。ここでは、古代のペプチドデータを整列させ、その後系統解析を実行するための、完全に再現可能な新しいパイプラインである PaleoProPhyler を紹介します。このパイプラインは、さまざまな形式のプロテオームデータを処理できるだけでなく、さまざまな形式 (CRAM、BAM、VCF) の遺伝データを簡単に利用して翻訳できるため、ユーザーは葉状プロテオーム解析用の参照パネルを作成できます。パイプラインのさまざまなステップとその多くの機能について説明し、その使用方法の例をいくつか示します。 PaleoProPhyler を使用すると、バイオインフォマティクスの経験がほとんどない研究者でも、古プロテオミクス配列を効率的に分析して、この貴重な進化データ源から洞察を得ることができます。

パイプライン、ワークフロー、古プロテオミクス、古プロテオミクス、系統発生学、フィロプロテオミクス、ヒト科の進化、人類の進化

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler: um pipeline reproduzível para inferência filogenética usando proteínas antigas

Proteínas antigas de restos fossilizados ou semi-fossilizados podem produzir informações filogenéticas em amplos horizontes temporais, em alguns casos até milhões de anos atrás. Nos últimos anos, peptídeos extraídos de hominídeos arcaicos e da megafauna há muito extinta permitiram insights sem precedentes sobre sua história evolutiva. Em contraste com o campo do DNA antigo - onde existem vários métodos computacionais para processar e analisar dados de sequenciamento - existem poucas ferramentas para lidar com dados de sequências de proteínas antigas. Em vez disso, a maioria dos estudos baseia-se em scripts personalizados pouco combinados, o que torna difícil reproduzir resultados ou partilhar metodologias entre grupos de investigação. Aqui, apresentamos PaleoProPhyler: um novo pipeline totalmente reproduzível para alinhar dados de peptídeos antigos e posteriormente realizar análises filogenéticas. O pipeline pode não apenas processar diversas formas de dados proteômicos, mas também aproveitar facilmente dados genéticos em diferentes formatos (CRAM, BAM, VCF) e traduzi-los, permitindo ao usuário criar painéis de referência para análises filoproteômicas. Descrevemos as várias etapas do pipeline e suas diversas funcionalidades e fornecemos alguns exemplos de como usá-lo. O PaleoProPhyler permite que pesquisadores com pouca experiência em bioinformática analisem eficientemente sequências paleoproteômicas, de modo a obter insights desta valiosa fonte de dados evolutivos.

Pipeline, Fluxo de trabalho, Paleoproteômica, Paleoproteômica, Filogenética, Filoproteômica, Evolução hominídea, Evolução humana

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler: воспроизводимый конвейер для филогенетических выводов с использованием древних белков

Древние белки из окаменелых или полуокаменелых останков могут дать филогенетическую информацию на широких временных горизонтах, в некоторых случаях даже на миллионы лет назад. В последние годы пептиды, извлеченные из архаичных гомининов и давно вымершей мегафауны, позволили получить беспрецедентное понимание их эволюционной истории. В отличие от области древней ДНК, где существует несколько вычислительных методов для обработки и анализа данных секвенирования, существует мало инструментов для обработки данных о последовательностях древних белков. Вместо этого большинство исследований опираются на слабо комбинированные специальные сценарии, что затрудняет воспроизведение результатов или обмен методологиями между исследовательскими группами. Здесь мы представляем PaleoProPhyler: новый полностью воспроизводимый конвейер для сопоставления данных о древних пептидах и последующего проведения филогенетического анализа. Конвейер может не только обрабатывать различные формы протеомных данных, но также легко использовать генетические данные в различных форматах (CRAM, BAM, VCF) и переводить их, позволяя пользователю создавать эталонные панели для филопротеомного анализа. Мы описываем различные этапы конвейера и его многочисленные функциональные возможности, а также приводим несколько примеров его использования. PaleoProPhyler позволяет исследователям с небольшим опытом в области биоинформатики эффективно анализировать палеопротеомные последовательности, чтобы получить ценную информацию из этого ценного источника эволюционных данных.

Конвейер, Рабочий процесс, Палеопротеомика, Палеопротеомика, Филогенетика, Филопротеомика, Эволюция гоминид, Эволюция человека

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

PaleoProPhyler：使用古代蛋白质进行系统发育推断的可重复管道

来自化石或半化石遗骸的古代蛋白质可以在广阔的时间范围内产生系统发育信息，在某些情况下甚至可以追溯到过去数百万年。近年来，从古人类和早已灭绝的巨型动物群中提取的肽使人们对它们的进化历史有了前所未有的了解。与古代 DNA 领域相比，存在多种计算方法来处理和分析测序数据，但处理古代蛋白质序列数据的工具却很少。相反，大多数研究依赖松散组合的自定义脚本，这使得很难在研究小组之间重现结果或共享方法。在这里，我们介绍 PaleoProPhyler：一种新的完全可重复的管道，用于对齐古代肽数据并随后进行系统发育分析。该管道不仅可以处理各种形式的蛋白质组数据，还可以轻松利用不同格式（CRAM、BAM、VCF）的遗传数据并对其进行翻译，从而允许用户创建用于系统蛋白质组分析的参考面板。我们描述了管道的各个步骤及其许多功能，并提供了一些如何使用它的示例。 PaleoProPhyler 使生物信息学经验很少的研究人员能够有效分析古蛋白质组序列，从而从这一宝贵的进化数据来源中获得见解。

管道、工作流程、古蛋白质组学、古蛋白质组学、系统发育学、系统蛋白质组学、原始人类进化、人类进化

Submission: posted 24 February 2023, validated 24 February 2023
Recommendation: posted 01 September 2023, validated 19 September 2023

Cite this recommendation as:
Hlusko, L. (2023) An open-source pipeline to reconstruct phylogenies with paleoproteomic data. Peer Community in Paleontology, 100220. https://doi.org/10.24072/pci.paleo.100220

Recommendation

One of the most recent technological advances in paleontology enables the characterization of ancient proteins, a new discipline known as palaeoproteomics (Ostrom et al., 2000; Warinner et al., 2022). Palaeoproteomics has superficial similarities with ancient DNA, as both work with ancient molecules, however the former focuses on peptides and the latter on nucleotides. While the study of ancient DNA is more established (e.g., Shapiro et al., 2019), palaeoproteomics is experiencing a rapid diversification of application, from deep time paleontology (e.g., Schroeter et al., 2022) to taxonomic identification of bone fragments (e.g., Douka et al., 2019), and determining genetic sex of ancient individuals (e.g., Lugli et al., 2022). However, as Patramanis et al. (2023) note in this manuscript, tools for analyzing protein sequence data are still in the informal stage, making the application of this methodology a challenge for many new-comers to the discipline, especially those with little bioinformatics expertise.

In the spirit of democratizing the field of palaeoproteomics, Patramanis et al. (2023) developed an open-source pipeline, PaleoProPhyler released under a CC-BY license (https://github.com/johnpatramanis/Proteomic_Pipeline). Here, Patramanis et al. (2023) introduce their workflow designed to facilitate the phylogenetic analysis of ancient proteins. This pipeline is built on the methods from earlier studies probing the phylogenetic relationships of an extinct genus of rhinoceros Stephanorhinus (Cappellini et al., 2019), the large extinct ape Gigantopithecus (Welker et al., 2019), and Homo antecessor (Welker et al., 2020). PaleoProPhyler has three interacting modules that initialize, construct, and analyze an input dataset. The authors provide a demonstration of application, presenting a molecular hominid phyloproteomic tree.

In order to run some of the analyses within the pipeline, the authors also generated the Hominid Palaeoproteomic Reference Dataset which includes 10,058 protein sequences per individual translated from publicly available whole genomes of extant hominids (orangutans, gorillas, chimpanzees, and humans) as well as some ancient genomes of Neanderthals and Denisovans. This valuable research resource is also publicly available, on Zenodo (Patramanis et al., 2022).

Three reviewers reported positively about the development of this program, noting its importance in advancing the application of palaeoproteomics more broadly in paleontology.

References

Cappellini, E., Welker, F., Pandolfi, L., Ramos-Madrigal, J., Samodova, D., Rüther, P. L., Fotakis, A. K., Lyon, D., Moreno-Mayar, J. V., Bukhsianidze, M., Rakownikow Jersie-Christensen, R., Mackie, M., Ginolhac, A., Ferring, R., Tappen, M., Palkopoulou, E., Dickinson, M. R., Stafford, T. W., Chan, Y. L., … Willerslev, E. (2019). Early Pleistocene enamel proteome from Dmanisi resolves Stephanorhinus phylogeny. Nature, 574(7776), 103–107. https://doi.org/10.1038/s41586-019-1555-y

Douka, K., Brown, S., Higham, T., Pääbo, S., Derevianko, A., and Shunkov, M. (2019). FINDER project: Collagen fingerprinting (ZooMS) for the identification of new human fossils. Antiquity, 93(367), e1. https://doi.org/10.15184/aqy.2019.3

Lugli, F., Nava, A., Sorrentino, R., Vazzana, A., Bortolini, E., Oxilia, G., Silvestrini, S., Nannini, N., Bondioli, L., Fewlass, H., Talamo, S., Bard, E., Mancini, L., Müller, W., Romandini, M., and Benazzi, S. (2022). Tracing the mobility of a Late Epigravettian (~ 13 ka) male infant from Grotte di Pradis (Northeastern Italian Prealps) at high-temporal resolution. Scientific Reports, 12(1), 8104. https://doi.org/10.1038/s41598-022-12193-6

Ostrom, P. H., Schall, M., Gandhi, H., Shen, T.-L., Hauschka, P. V., Strahler, J. R., and Gage, D. A. (2000). New strategies for characterizing ancient proteins using matrix-assisted laser desorption ionization mass spectrometry. Geochimica et Cosmochimica Acta, 64(6), 1043–1050. https://doi.org/10.1016/S0016-7037(99)00381-6

Patramanis, I., Ramos-Madrigal, J., Cappellini, E., and Racimo, F. (2022). Hominid Palaeoproteomic Reference Dataset (1.0.1) [dataset]. Zenodo. https://doi.org/10.5281/ZENODO.7333226

Patramanis, I., Ramos-Madrigal, J., Cappellini, E., and Racimo, F. (2023). PaleoProPhyler: A reproducible pipeline for phylogenetic inference using ancient proteins. BioRxiv, 519721, ver. 3 peer-reviewed by PCI Paleo. https://doi.org/10.1101/2022.12.12.519721

Schroeter, E. R., Cleland, T. P., and Schweitzer, M. H. (2022). Deep Time Paleoproteomics: Looking Forward. Journal of Proteome Research, 21(1), 9–19. https://doi.org/10.1021/acs.jproteome.1c00755

Shapiro, B., Barlow, A., Heintzman, P. D., Hofreiter, M., Paijmans, J. L. A., and Soares, A. E. R. (Eds.). (2019). Ancient DNA: Methods and Protocols (2nd ed., Vol. 1963). Humana, New York. https://doi.org/10.1007/978-1-4939-9176-1

Warinner, C., Korzow Richter, K., and Collins, M. J. (2022). Paleoproteomics. Chemical Reviews, 122(16), 13401–13446. https://doi.org/10.1021/acs.chemrev.1c00703

Welker, F., Ramos-Madrigal, J., Gutenbrunner, P., Mackie, M., Tiwary, S., Rakownikow Jersie-Christensen, R., Chiva, C., Dickinson, M. R., Kuhlwilm, M., De Manuel, M., Gelabert, P., Martinón-Torres, M., Margvelashvili, A., Arsuaga, J. L., Carbonell, E., Marques-Bonet, T., Penkman, K., Sabidó, E., Cox, J., … Cappellini, E. (2020). The dental proteome of Homo antecessor. Nature, 580(7802), 235–238. https://doi.org/10.1038/s41586-020-2153-8

Welker, F., Ramos-Madrigal, J., Kuhlwilm, M., Liao, W., Gutenbrunner, P., De Manuel, M., Samodova, D., Mackie, M., Allentoft, M. E., Bacon, A.-M., Collins, M. J., Cox, J., Lalueza-Fox, C., Olsen, J. V., Demeter, F., Wang, W., Marques-Bonet, T., and Cappellini, E. (2019). Enamel proteome shows that Gigantopithecus was an early diverging pongine. Nature, 576(7786), 262–265. https://doi.org/10.1038/s41586-019-1728-8

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
The project was funded by the European Union’s EU Framework Programme for Research and Innovation Horizon 2020, under Grant Agreement No. 861389- PUSHH. FR was additionally supported by a Villum Young Investigator Grant (project no. 00025300), a COREX ERC Synergy grant (ID 951385) and a Novo Nordisk Fonden Data Science Ascending Investigator Award (NNF22OC0076816). E.C. was additionally supported by the European Research Council (ERC) through the ERC Advanced Grant ”BACKWARD”, under the Eu- ropean Union’s Horizon 2020 research and innovation program (grant agreement No. 101021361)

Reviews

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2022.12.12.519721

Version of the preprint: 1

Author's Reply, 25 Aug 2023

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.paleo.100220.ar1

Decision by Leslea Hlusko, posted 17 Jul 2023, validated 17 Jul 2023

Thank you for your patience as we located reviewers for your manuscript and gave them time to read and implement the pipeline. We now have three reviews (2 anonymous and 1 signed) that are presented in the spirit of advancing science respectfully and thoughtfully. All three are very supportive of your development and public posting of PaleoProPhyler. While one reviewer ran into difficulting executing two of the three modules in the pipeline, this reviewer was encouraging of your approach. All three reviewers offer specific advice on how to improve your manuscript, including a more detailed description of the three modules. As you prepare your revision, please include a response to the reviewers. I look forward to reading your revision.

https://doi.org/10.24072/pci.paleo.100220.d1

Reviewed by anonymous reviewer 1, 24 Mar 2023

Patramanis et al. describe PaleoProPhyler, a pipeline to download, build, and analyze protein sequence databases for phylogenetics including with paleoproteomic sequences. This is an interesting workflow and will help standardize phylogenetics in paleoproteomics.

“files into amino acid seuqences” should be “files into amino acid sequences”

Description of the Pipeline: Please add detail/summary of each module here in the main manuscript. The supplementary has nice detail of each module, but it is very lacking here in the main manuscript.

Supplementary Choosing and preparing the list of proteins: \Reference_Protein_List.txt is not present in the github repository

Supplementary Final Execution: This section feels incomplete. I’m not sure what is needed, but more detail is probably helpful.

https://doi.org/10.24072/pci.paleo.100220.rev11

Reviewed by Katerina Douka, 20 Apr 2023

This is an exciting development in the field of palaeoproteomics and one that the community will welcome. I recomment the manuscript for publication and include below my comments and some minor corrections/additions.

-------

1/ To maket the manuscript appear more informed, I would add in the first paragraph that while shotgun proteomics is used to infer phylogenic relationships, another palaeoproteomics method (PMF or ZooMS when for collagen) is used as a primary tool for identifying new hominid remains, which can then be analysed deeper with shotgun proteomics, ultimately using the new bioinformatics tool presented here.

If so, I would add a few more references aside from the Copenhagen group. We are talking about democratisation of the field, citing more widely is part of it too. I believe the oldest collagen analysed so far is presented in Rybczynski et al. (2013) also more recently expanded (Buckley et al. 2020, cited already), and other teams have also published very ancient proteins (e.g Nielsen-Marsh et al. 2009 (https://www.sciencedirect.com/science/article/pii/S0305440309001253), or Brown et al. 2022 (https://www.nature.com/articles/s41559-021-01581-2).

2/ Page 2. "The amount of publicly available proteome sequences is much smaller in comparison".-> Can you quantify this? There are indeed very few.

3/ For Module 3, I would have appreciated comments on thresholds or limitations for the use of PaleoProPhyler. Are there are any? What are the limitations imposed by the (often) small number and of poor preservation of proteins/peptides for a given sample. Are there cut-offs and suggestions how to overcome them?

4/ There is a mention for Supplementray Material, I could not see it or access it.

5/ Unless there is a very specific word limitation, there is very little in the description of how the pipeline works and even what each Module does. I like the graphical abstract but I was left wondering where is the input and output and, as alaready mentioned, indication of cut-offs and generaly data hygene.

Some minor stuff:

“…lab-generated protein data does not even exist” : Remove even
“…absence of knowledge about even a single amino acid polymorphism”: Remove even
“The modules are intended to synergize with each other” : I am not sure of the word synergize here. Maybe best to keep it simple and say “work with each other”

https://doi.org/10.24072/pci.paleo.100220.rev12

Reviewed by anonymous reviewer 2, 17 Jul 2023

The manuscript ‘PaleoProPhyler: a reproducible pipeline for phylogenetic inference using ancient proteins’ by Patramanis and colleagues presents an open-source pipeline for the phylogenetic analysis of palaeoproteomic data. The pipeline is split into three modules which follow on from each other, but can be run independently. These build a basic reference database from proteomes available on Ensembl (module 1), transcribe published genomes to supplement the reference database (module 2), and perform phylogenetic analysis of proteomic data using the reference database (module 3). The motivation for the development of the pipeline and a brief overview are provided in the main text with a more detailed explanation of the workflow presented in the supplementary information and the code available on the github of the lead author. A tutorial is provided to train users in how to install and run the pipeline using published data to reconstruct the enamel phylogeny of two hominids, Homo antecessor (Welker et al 2020) and Gigantopithecus blacki (Welker et al 2019). The authors used modules 1 and 2 of the pipeline to curate a hominid palaeoproteomics reference database which they make publicly available on Zenodo.

Open-source tools for reproducible data processing and analysis between different research groups and labs are important areas of development for the field of palaeoproteomics, as they are currently lacking. This hinders data reproducibility and represents a barrier to researchers within the field who lack formal training in computational biology. The PaleoProPhyler pipeline presented by the authors addresses this issue and therefore has the potential to be a timely and important addition to the toolset available to the palaeoproteomics community. The rationale for the work is clear and the manuscript is well written. The modularity of the pipeline is highly useful and will enable users to adopt portions of the pipeline for their own uses. The tutorial written with a ‘non-bioinformatics-background audience in mind’ is an excellent resource to increase accessibility to a wide range of researchers and achieve the aim of improving reproducibility within the field.

I am not a bioinformatician so will not comment on the scripts themselves but will comment from the point of view of the ‘non-bioinformatics’ audience, the target audience of the tutorial. Unfortunately, I was only able to run the first module of the pipeline when following the tutorial, whilst Modules 2 and 3 resulted in errors and termination of the script. Perhaps readers with bioinformatics training would be able to adapt the scripts to make them run but even with access to server and bioinformatics support I was unable to complete the tutorial. Therefore, to be widely employed by researchers with different computational setups, some revisions to minimise dependencies and the potential for clashes between systems would be beneficial.

The tutorial first directs the user to download the github workflow and install the conda environment from the command line and then download the published fasta files of the two hominid proteomes. As noted by the authors, the user ideally needs access to a high performance lab server for sufficient computational power to run the pipeline. The installation of the conda environments and pipeline may clash with pre-installed software on the institutional server which the user has no access to modify. This may act as a barrier to the installation of the pipeline.

The first module generates a scaffold reference database by downloading proteomes from species closely related to the hominids from Ensembl, a publicly available database for annotated genomic data. The second module is designed to supplement the scaffold reference database through the transcription of published genomic data, including other ancient hominins.

Running the first module was relatively quick and straightforward. Some further information or references could be added on the strengths/weaknesses of downloading reference proteomes from Ensembl vs translating genomes. I was unable to run the second module so cannot comment on the output.

The third module merges together the palaeoproteomic data with the reference datasets and performs phylogenetic analysis. Implementing the module seems very straight-forward, however the tutorial ends abruptly after the analysis has been run with no further information on where the output files are generated. The tutorial could be improved by adding additional information here on how to check the output of the analysis (as the authors did at the end of module 1), how to visualise the trees generated data and some simple QC checks to carry out.

Although the pipeline may run successfully on the author’s institutional server, it needs to be packaged more efficiently for widespread use. There appears to be some typos in the code or system incompatibility which prevent the pipeline from running to completion. It would require a bioinformatician to troubleshoot the errors. This is therefore a barrier to anyone without this knowledge base.

This is a common problem when sending code between labs and can require some complicated trial and error to solve. I suggest packaging the software into a container so it can be shared between labs without issues of installation in clashing systems. Enlisting several researchers from different labs outside the Globe Institute to install and run the pipeline tutorial on their own servers would provide the authors with the opportunity to trouble-shoot any issues that arise.

Other points:

The system requirements for running the pipeline on a linux OS are not apparent until the SI and tutorial - this could be mentioned in the main text under ‘Availability and Community Guidelines’.
The hominid reference database will be highly useful. Although the references for the data are available in the SI, a table with all of the individuals included would be useful.
Overall the authors have done a good job adding useful tips, warnings, additional descriptions and links to resources to help users who are new to this type of analysis. Perhaps a text box with a glossary/key terms to provide additional descriptions of the different file types (FASTA, VCF, BAM. CRAM) could be useful for a non-bioinformatics audience, as there are lots of abbreviations used.
Ref 61 in the first paragraph of the Statement of Need appears to have no link.
There are some typos throughout the tutorial text so some proofreading would be beneficial.

https://doi.org/10.24072/pci.paleo.100220.rev13