The Málaga Corpus of Early Modern English Scientific Texts

This corpus contains hitherto unedited medical manuscripts produced in the period 1500-1700. The ultimate aim of the research team is to produce a balanced corpus in terms of the taxonomy of early English medical writing, i.e. theoretical treatises, surgical treatises and remedies.

Corpus Files

The corpus has been compiled in three stages. During each of these stages, a different version of the corpus has been produced, which will eventually serve different research purposes:

  1. Plain text corpus (.txt): these files contain the semi-diplomatic transcription of the treatises included in the corpus, where original spelling and word division have been preserved. Additionally, these transcriptions are available, together with the digitised images of the original manuscripts, at the project's website.

    Take thyme seedes rosemary seedes parsley seede the middle rinde of the walnut tree gromell seede saxifrage seede the kernells of the eglantine berryes

  2. Normalised corpus (.norm): these files contain the normalised transcriptions of the treatises included in the corpus. The normalisation process has been carried out by means of VARD, which standardises the variant forms to Present Day English and inserts an XML-tag so that the original word can be consulted.

    Take thyme <normalised orig="seedes" auto="false">seeds</normalised> rosemary <normalised orig="seedes" auto="false">seeds</normalised> parsley <normalised orig="seede" auto="false">seed</normalised> the middle <normalised orig="rinde" auto="false">rind</normalised> of the walnut tree <normalised orig="gromell" auto="false">gromwell</normalised> <normalised orig="seede" auto="false">seed</normalised> <normalised orig="saxifrage" auto="false">saxifrage</normalised> <normalised orig="seede" auto="false">seed</normalised> the <normalised orig="kernells" auto="false">kernels</normalised> of the eglantine <normalised orig="berryes" auto="false">berries</normalised>

  3. POS-tagged corpus (.pos): these files contain the POS-tagged version of the corpus, which has been carried out by way of CLAWS, which assigns a morpho-syntactic tag to each word in the corpus, punctuation marks included. The C7 tagset has been employed.

    Take_VV0 thyme_NN1 seeds_NN2 rosemary_NN1 seeds_NN2 parsley_NN1 seed_NN1 the_AT middle_JJ rind_NN1 of_IO the_AT walnut_NN1 tree_NN1 gromwell_NN1 seed_NN1 saxifrage_NN1 seed_NN1 the_AT kernels_NN2 of_IO the_AT eglantine_JJ berries_NN2

Please note that (.NORM) and (.POS) files cannot be used to quote original examples of the corpus, as they contain normalised versions of the texts.



For further enquiries, please contact:

How to cite

Calle-Martín, Javier et al. 2016. The Malaga Corpus of Early Modern English Scientific Prose (MCEMESP). Málaga: University of Málaga. Available from http://modernmss.uma.es.