From CEEMID to Reprex

Historically CEEMID started out as the Central and Eastern European Music Industry Databases out of necessity following a CISAC Good Governance Seminar for European Societies in 2013. The adoption of European single market and copyright rules, and the increased activity of competition authority and regulators required a more structured approach to set collective royalty and compensations tariffs in a region that was regarded traditionally as data-poor with lower quantity of industry and government data sources available.

In 2014 three societies, Artisjus, HDS and SOZA realized that need to make further efforts to modernize the way they measure their own economic impact, the economic value of their licenses to remain competitive in advocating the interests vis-à-vis domestic governments, international organizations like CISAC and GESAC and the European Union. They signed a Memorandum of Understanding with their consultant to set up the CEEMID databases and to harmonize their efforts.

  • The first Hungarian Music Industry Report, a 144-pages business strategy and policy advocacy report, which became the basis of annual reports in the Hungarian music industry.

  • The first Slovak Music Industy Report, a 227-pages advocacy report with business strategy and evidence-based policy recommendations. Several royalty pricing and other fact-based industry work was commissioned by Slovak stakeholders which are not publicly available.

  • Private Copying in Croatia is an advocacy report for re-setting the remuneration of private copying, and measuring the value transfer to media platforms such as YouTube. In Hungary, more technical and detailed reports were made for Artisjus, Mahasz, EJI, Hungart and Filmjus, which are not available to the public.

  • CEEMID was used in various quantitative ex ante granting assessments, in royalty price setting, in calculating private copying remuneration, predicting audiences, and other evidence-based policy projects.

  • See the CEEMID Documentation Wiki for more information about data coverage and methodology.


Reprex B.V. is a reproducible research company that tries to put CEEMID’s intellectual property in the data mapping, open data, automated research, and its 2000 cultural and creative sector indicators on a sustainable business model. We believe that CEEMID created a globally unique data program which had too few users and too ad hoc and scarce (private) funding that made a great product financially infeasible.

We believe that whenever a business or policy consulting team, a research institute, or data journalism team has already used, formatted, and analyzed data from an external source at least twice, this procedure should be automated. This makes it error-free, well documented, cheap and re-usable. Furthermore, making data collection ongoing instead being ad hoc saves data acquisition, validation and supervision costs. We would like to help medium-sized business, policy, NGO, scientific and data journalism organizations in this, who do not have the institutional capacity to hire data scientists and engineers.

We made critical elements of our software product peer-reviewed open source statistical software, and on the basis of these elements we created a minimum viable product.

Our minimum viable product is offered in two forms: data-as-service, and solution-as-service, was offered to three customer segments with similar research problem (business/policy consultancies, university research institutes, and data journalism teams). We have not received a single refusal, and we already have contractual or letter of intent commitments, several of our prospects immediately referred us further.

They have committed in forms of contracts or letters of intents to incorporate it into planned research activities and some subscription products, and some of them have already assigned budgets and resources to these projects. While we see, based on our teams experience in these segments, that the problems and workflows that we support are very similar, the business/funding model of these three segments is very different. We create value by continuous automation, which is a different cost structure than our customer’s project/grant based ad-hoc founding, and we are working on a cooperation model that bridges these differences to exploit the highest value proposition. We find it very encouraging that so far none of our targets refused our offer, and several of them immediately referred us further.

Data sources

The grew out of a collaborative observatory, CEEMID. CEEMID is aiming to transfer thousands of indicators and a verifiable, open-source software that creates them to the European Music Observatory to give Europe-wide acces timely, reliable, actionable statistics and indicators for the music industry, policymakers and music professionals. (Read more about our data coverage)

Reprex is aiming to support this transition, and at the same time, create new data products for other creative industries. See our call for partners.

  1. Open data: In the EU, open data is governed by the Directive on open data and the re-use of public sector information - in short: Open Data Directive (EU) 2019 / 1024. It entered into force on 16 July 2019. It replaces the Public Sector Information Directive, also known as the PSI Directive which dated from 2003 and was subsequently amended in 2013. The founder of CEEMID, Daniel Antal, has been involved in Open Data and PSI since 2008. Open data is usually raw data and requires significant statistical know-how and effort to process it to useful statistical or key performance indicators.

  2. Shared resources: We try to encourage the industry to share data resources and exploit it better for a whole range of professional activities, such as advocacy, collection monitoring, better pricing. We treat our own survey program as a shared resource — our partners can distribute these online surveys at almost no cost and gather very important, pooled, comparable data assets.

  3. Industry data resources: The music industry is dominated by freelancers and microenterprises. In most countries they do not participate in statistical reporting and submit only simplified tax returns and create simplified annual reports. This means that the music industry, and creative & cultural industries are not well represented in official statistics, and remain rather invisible for both business and policymakers.

In our view, there is no alternative to collect the missing data on the initiative of the music industry. CISAC on the author’s, publisher’s side, and IFPI on the recording side collects plenty of important market data, and Live DMA has a fledgling data program for some aspects of European live music. Most of these data sources are not public, but our private data integration allows their users to make a lot-lot more out of these data sources, because we can add critically missing information to use them for royalty valuation or market forecasting.

Our own survey program is designed to collect the information that is missing from these industry data collections and also from the EU statistical frameworks.

Music Professional Surveys

Our music professional surveys are designed to collect information that is not available in other music industry sources. Because the music industry is dominated by freelancers and microenterprises, our surveys target three groups: the performing (recoded and live) artists, the technicians (both recordings and live), and managers (again, recorded and live.)

CEEMID uses Music Professional surveys to understand working conditions, skills, remunerations, concert economy, granting. (See: who are the music professionals?) Statistical data, such as unemployment, average wage or GDP data is mainly produced by statistical reporting, anonymized tax returns data and mandatory financial statement data by the national statistical authorities.

In live music, there are very few comprehensive data sources in Europe, and they are very hard to access and integrate. Live DMA collects data in order to analyze the situation of live music venues and clubs in Europe. Their survey program is excellent, but only covers some aspects of live music and only in a small number of countries, which had not overlapped so far much with CEEMID’s. The two surveys greatly complement each other, because we survey the individuals who use the venues for their projects, while they ask the venue managers. We are asking about the costs and revenues of the projects that are hosted in venues, and they ask the venues themselves. For users of LIVE DMA survey, our data just fills out the a lot of gaps. In European countries where LIVE DMA has not yet set foot, our surveys may be the only data source on the live music economy — the part of the music business that created the most income in the 2010 and which was shut down by the pandemic.

Our survey complements IFPI’s data resources on the recording industry. IFPI collects data via its national affiliates from the labels. Our research shows that this usually leaves a small self-published segment, which is not significant in market value, but significant in recording output. However, our survey also asks further upstream in the value chain, for example, about recording costs, which is not visible for labels. Most of the recordings in the 21st century are not financed by labels anymore, but bands, and increasingly, at least in Europe, by publishers, because in some markets, the publishing side has a significantly higher value than the recording side. For IFPI affiliated users, our surveys just add more detail, and increases the use of the confidential datasets that IFPI itself collects for marketing and pricing.

Our surveys even more complement CISAC’s own data collection program. After the Commission v CISAC case (InfoCuria 2013) the authors’ societies international organization stopped collecting and disseminating any pricing related information. In our view, this was a too cautious overreaction to a competition authority intervention, and put especially the smaller member societies who do not have research departments into a very difficult position. It also started a very dangerous process, where the societies, without actual valuations, try to set their prices to each other’s averages. Because prices are only regulated by competition law, which can only downwards adjust prices, this is leading to a race to the bottom. Our surveys (and other CEEMID data sources) were directly aimed at helping CISAC’s member (and not CISAC) in valuations.

Grant managers often have a duty to collect survey data for ex post grant evaluation. We give a general framework mainly for ex ante grant evaluation (i.e. making sure that a forthcoming grant call is suitable for the target audience) but we also collect high-level data that may be used as a starting point of ex post grant evaluation.

In our view, the greatest challenge of the music industry is the lack of any strategic HR function in most of the industry. This means that music professionals hardly have access to life-long learning. Before the pandemic, the markets we analyzed were plagued by structural unemployment: many people were complaining that they do not get enough paid jobs, while employers were complaining that they do not find reliably skilled freelancers of employees. The music industry is a high-value added industry that mainly adds value to the economy via its labour force. We try to collect data that may help the industry to manage better the education and vocational training of the industry.

Taken from the Central European Music Industry Report

Figure 6.1: Taken from the Central European Music Industry Report

And last, but not least, our surveys target individuals which allows us to understand the structural changes in the industry. Before the COVID-19 pandemic we realized that in some markets, the publishing side became so much more valuable compared to the recording side that publishers started to invest into sound recordings. Obviously, these trends are not well captured either the recording industry’s and the publishing side’s very different data sources. Our approach gives the live music, recording industry, publishing and grant managers a comprehensive view that also shows how their roles are defined in relation to each other. (See Chapter 3 The Creation of Music in our Central European Music Industry Report for an example.)

Cultural Access & Participation Surveys

Our surveys collect data from the supply of the industry, while audience surveys collect data from the demand side. When the concert or recording market is in balance, some information can be analyzed by only collecting data from the supply or demand side. For example, even though price should be the same, we query payers as well as payees.

A working group of Eurostat and many national statistical offices within the ESSnet Culture working group has reviewed the best practices in survey design and survey harmonization in 2012. Standardized CAP surveys allow the use of comparison with international surveys–such as the CAP surveys carried out by the European Union and with other countries–and they allow the use of standardized evaluation for survey results.

The standardized quantitative surveys of content use are ‘Cultural Access & Participation’ surveys, which were standardized by the European Statistical System Network’s Culture working group. The Final Report of the Woking Group European Statistical System Network on Culture (in short: ESSnet-Culture) (Bína, Vladimir et al. 2012) contains a rather detailed guideline in the report of the Task Force on Cultural Practices And Social Aspects Of Culture, in the ESSNet-Culture technical report (in general: pp. 236-242, survey methodology and harmonization pp. 242-255, recommendations: pp. 273-274, including an extensive annex on pp. 397-417; furthermore in Cultural participation (cult_pcs) Reference Metadata in Euro SDMX Metadata Structure (ESMS)) which contains a very mature social scientific model to measure participation, and survey methodology and samples description on how to carry out such surveys. CEEMID has carried out so far 7 detailed, nationally representative CAP surveys for music and audiovisual use, which it retrospectively harmonized with EU 2007 and EU 2013 surveys (see ?? Annex - Survey harmonization.)

Survey harmonization

The top-level, basic questions were standardized by the ESSnet culture working group. They are based on the ICET surveying model, that in turn has a history in quantitative surveying of entertainment industry audience since the 1970s. The ICET model itself was first designed in the Netherlands about 20 years ago for a better measurement of the then increasingly digital forms of cultural participation (Haan and Adolfsen 2008; Haan and Broek 2012).

Our retroharmonize R package was created to create new, unpublished statistical indicators from pan-European surveys (Daniel Antal 2020c) by finding and re-coding the same information in the transcripts of the interviews themselves (in the microdata.)

Other data resources

Data integration

From the originally envisioned, centralized, permission-based data structure, due to practical considerations, CEEMID switched to a more flexible, decentralized approach. This approach is based on continuous data integration, which requires permissions to use business confidential information only in use. This allowed a rapid extension of CEEMID to the whole of Europe and go even beyond. As a result of continous data integration it already includes hundreds of indicators foreseen in all pillars of the planned European Music Observatory.

While CEEMID is aware of and uses the metadata of CISAC’s, IFPI’s, EAO’s, and other industry sources’ data, it does not contain this data; such data is only accessed when a user with permission for the use of these industry sources requires the integration of such data with other CEEMID data, or user-specific data. While this approach makes sharing results more cumbersome, it provided a path to increase the number of useful indicators from a few dozens to around a thousand. Furthermore, it exponentially increases the value of CISAC’s, IFPI’s or EAO’s data, especially when designing better royalty rates, or creating economic evidence for litigation. Take a look at a simple, non-confidential example blog post.

We will integrate data into open data products and music industry intelligence apps from the following sources:

  • Nationally representative CAP surveys of music users and film viewers.

  • Anonymous CEEMID Music Professional Surveys and CEEMID Audiovisual Professional Surveys about their work, incomes and costs. See example blog post.

  • Big data sources from various geolocational applications about events and location visits small video.

  • Automatic data retrieval from open data sources, including statistical data and EU-funded research. See example blog post


Our reproducible researech workflow is based on the statistical programming language R (R Core Team 2020). R is the open source version of the statistical programming language S, and it is widely used in national statistical offices (Templ and Todorov 2016), we believe that it is the 21st century lingua franca of statistics. It is very-high level, non-compiled language that is very easy to use, modify, and even single lines of code can be executed. In other words, it is very well suited for literate programming, i.e. human-readable program codes that help peer-review.

As members of the rOpenGov initative, we actively contribute and create various open-source “packages”, or software libraries that allow a reproducible access to open data. The eurostat (Lahti et al. 2020, 2017) package allows API access and basic processing to the Eurostat data warehouse. Because of Eurostat’s problematic regional statistics, we amended it with further software that became the (Daniel Antal 2020b) package. For the use of symmetric input-output tables in economic impact assessments we created the (Daniel Antal 2020a) package — because these data resources cannot be used without further processing from the Eurostat warehouse.

In order to create data products that can be easily used on any personal computer, spreadsheet application, statistical software, or inserted to a relational database, our data must comply with the statistical definition of tidy data (Wickham 2014). Our indicators are usually go through the tidyverse software packages of dplyr: A Grammar of Data Manipulation and tidyr: Tidy Messy Data (Wickham et al. 2020; Wickham 2020) and the accompanying purrr: Functional Programming Tools (Henry and Wickham 2020). For the analyst, it brings R very close to SQL, to the point that you can write mixed R/SQL scripts.

For the reproducible creation of this data catalogue we use (Xie 2020a), a book-form dynamic reporting tool based on (Xie 2020b, 2015). They are all based on (Allaire et al. 2020; Xie, Allaire, and Grolemund 2018), which allows the combination of marked-up text for various text outputs such as PDF, html, e-books, Word documents with program codes in the R, Phython, C++ languages. We are also welcome contributions in R, Python or C++.

The use of open source software and the open source R statistical language allows a continuous peer-review of data ingestion, processing, corrections and indicator creation by statisticians, data scientists and academics. For example, this allowed us to compare test results on calculating economic impact indicators for the creative industries and other industries with the UK statistical office.

Data processing

  • iotables is a reproducible research tool that is able to work with national accounts and create some satellite accounts for all EU member states. It was originally developed to calculate the economic impacts of the Hungarian tax shelter before renewal (state aid notification at DG Competition) and for the Slovak Music Industry Report, which used similar methodology to prove that CCS sectors are overtaxed in the country. The iotables open source statistical software library is used by about 800 practicioners in the world.

  • regions solves the problem that Europe’s regional boundaries have changes in several thousand places over the last 20 years, and therefore member states and Eurostat’s regional statistics are not comparable for more than 2-3 years. This software validates, and where possible, changes the regional coding from NUTS1999 till the not yet used NUTS2021, opening up . It was originally designed in a research project at IVIR in the University of Amsterdam to understand the geographical dynamics of book piracy. Because of the needs this software fills, it had 700 users in the first month after publication.

  • retorhamonize is a software that allows the programmatic retrospective harmonization of surveys, such as the last 35 years of all Eurobarometer microdata, or all Afrobarometer microdata. Eurobarometer grew out of the need of certain CEE member states’ need to get comparable data about their music and audiovisual sectors. We commissioned surveys following ESSNet-Culture guidelines, and joined their data with open access European microdata-level surveys.

These tools, and other similar tools that Daniel develops, allow testing the ideas and documenting them. These follow the concept of open collaboration (with other statistical, academic, national member state data sources) and reproducible research.

Various models

For modelling we use the tidymodels concept, which brings hundreds of R analytical libraries, and increasingly Python libraries availbe via a unified API. Tidymodels allows us to pre-run tens of thousands of various models for the client, and make a pre-selection for them about the most promising analytical tools to use to their problem.

Tidymodels is itself in an early stage of development. Leaning tidymodels has enormous benefits for our analytical work, and if you want to be involved in the econometric / machine learning services, you have to use it.

It does not matter what final packages and classes you will use for final models. If you are comfortable with data.tables, you will use data.table. The tidyverse and the tidymodels play a role in pre-processing and processing various data resources for analysis.

How do reproducible research create value?

  • It improves work habits, and enhances the efficiency of analysts.

  • Increases teamwork, makes the integration and replacement of team members far easier.

  • Avoids duplication and multiplication of efforts. Dramatically reduces the time spend on data manipulation and debugging, error searching, formatting. This can save up to 80% of working time in analyst and consultant roles.

  • Relief senior staff as oversight is far more efficient, as most errors are captured automatically.

  • Better suited for cumulative growth of data, information and knowledge. In the medium run it reduces data and information cost significantly, and over the long run it produces a very strong competitive edge.

  • Replication, reproducibility and the higher standards of confirmabiliy and auditability are not only scientific standards, but they are often set by market regulators, professional standards and internal working guidelines.

  • Access to the growing body of open data in the EU (such as survey data, raw data used to calculate the inflation, etc.), which is as raw data free or almost free, but has large processing costs, as it is offered by public bodies on an as-is basis. This can replace costlier and often less valuable data acquisitions.

Functions [in programming] allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste […] You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). Hadely Wickham in R For Data Science.

We want to take this idea much further. We believe that any data table from Eurostat, IMF, various industry sources, APIs, that was downloaded, acquired at least twice, should arrive to the organization via a data ingestion application that automatically acquires every new edition of the data asset. Instead of re-formatting, adjusting for units, currencies, missing values each time, the application to always present the data asset in its best available form.

We believe that every table, visualization and supervised model (a model that is not created by a machine but your analysts) that has been produced at least twice, should be produced by an application that produces all related tables, visualization and model results at any change of the data.

Other observatories

Observatories are created to permanently collect data, information, and createa knowledge-base for research and development, science, evidence-based policymaking, usually by Consortia of business, NGO, scientific and public bodies.

We are aiming to create similar observatories, but we are in no way affiliated or connected to the following, existing observatories that we see as role models. Our mission is to serve similar observatories with research automation, making the observatory’s services less costly, more timely with higher level of quality control.


Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2020. Rmarkdown: Dynamic Documents for R.

Antal, Daniel. 2020a. iotables: Importing and Manipulating Symmetric Input-Output Tables.

Antal, Daniel. 2020b. “regions R package for sub-national boundary harmonization.”

Antal, Daniel. 2020c. “retroharmonize R package for ex post survey harmonization.”

Bína, Vladimir, Chantepie, Philippe, Deboin, Valérie, Kommel, Kutt, Kotynek, Josef, and Robin, Philippe. 2012. “ESSnet-CULTURE, European Statistical System Network on Culture. Final Report.” Edited by Frank, Guy.

Haan, Jos de, and Anna Adolfsen. 2008. De Virtuele Cultuurbezoeker - Publieke Belangstelling Voor Cultuurwebsites. SCP-Publicatie 2008/9. Den Haag, the Netherlands: Sociaal en Cultureel Planbureau.

Haan, Jos de, and Andries van den Broek. 2012. “Nowadays Cultural Participation - an Update of What to Look for and Where to Look for It.” In ESSnet-CULTURE, European Statistical System Network on Culture. Final Report., 397–417. Luxembourg.

Henry, Lionel, and Hadley Wickham. 2020. Purrr: Functional Programming Tools.

InfoCuria. 2013. “T-442/08 CISAC V Commission.”

Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2017. “Eurostat R Package.” R Journal.

Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2020. Eurostat: Tools for Eurostat Open Data.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Templ, Matthias, and Valentin Todorov. 2016. “The Software Environment R for Official Statistics and Survey Methodology.” Austrian Journal of Statistics 45 (March): 97–124.

Wickham, Hadley. 2020. Tidyr: Tidy Messy Data.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC.

Xie, Yihui. 2020a. Bookdown: Authoring Books and Technical Documents with R Markdown.

Xie, Yihui. 2020b. Knitr: A General-Purpose Package for Dynamic Report Generation in R.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC.