Title: | Convert Files to Parquet Format |
---|---|
Description: | Collection of functions to get files in parquet format. Parquet is a columnar storage file format <https://parquet.apache.org/>. The files to convert can be of several formats ("csv", "RData", "rds", "RSQLite", "json", "ndjson", "SAS", "SPSS"...). |
Authors: | Damien Dotta [aut, cre], Nicolas Chuche [aut] |
Maintainer: | Damien Dotta <[email protected]> |
License: | Apache License (>= 2.0) |
Version: | 0.5.7 |
Built: | 2024-11-21 05:51:42 UTC |
Source: | https://github.com/ddotta/parquetize |
This function checks if a file/dataset is a valid parquet format. It will print the number of lines/columns and return a tibble on columns information.
check_parquet(path)
check_parquet(path)
path |
path to the file or dataset |
This function will :
open the parquet dataset/file to check if it's valid
print the number of lines
print the number of columns
return a tibble with 2 columns :
the column name (string)
the arrow type (string)
You can find a list of arrow type in the documentation on this page.
a tibble with information on parquet dataset/file's columns with three columns : field name, arrow type and nullable
# check a parquet file check_parquet(parquetize_example("iris.parquet")) # check a parquet dataset check_parquet(parquetize_example("iris_dataset"))
# check a parquet file check_parquet(parquetize_example("iris.parquet")) # check a parquet dataset check_parquet(parquetize_example("iris_dataset"))
This function allows to convert a csv or a txt file to parquet format.
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
csv_to_parquet( path_to_file, url_to_csv = lifecycle::deprecated(), csv_as_a_zip = lifecycle::deprecated(), filename_in_zip, path_to_parquet, columns = "all", compression = "snappy", compression_level = NULL, partition = "no", encoding = "UTF-8", read_delim_args = list(), ... )
csv_to_parquet( path_to_file, url_to_csv = lifecycle::deprecated(), csv_as_a_zip = lifecycle::deprecated(), filename_in_zip, path_to_parquet, columns = "all", compression = "snappy", compression_level = NULL, partition = "no", encoding = "UTF-8", read_delim_args = list(), ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
url_to_csv |
DEPRECATED use path_to_file instead |
csv_as_a_zip |
DEPRECATED |
filename_in_zip |
name of the csv/txt file in the zip. Required if several csv/txt are included in the zip. |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
columns |
Character vector of columns to select from the input file (by default, all columns are selected). |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
encoding |
String that indicates the character encoding for the input file. |
read_delim_args |
list of arguments for |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
Be careful, if the zip size exceeds 4 GB, the function may truncate
the data (because unzip() won't work reliably in this case -
see here).
In this case, it's advised to unzip your csv/txt file by hand
(for example with 7-Zip)
then use the function with the argument path_to_file
.
# Conversion from a local csv file to a single parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext=".parquet") ) # Conversion from a local txt file to a single parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.txt"), path_to_parquet = tempfile(fileext=".parquet") ) # Conversion from a local csv file to a single parquet file and select only # few columns : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext = ".parquet"), columns = c("REG","LIBELLE") ) # Conversion from a local csv file to a partitioned parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("REG") ) # Conversion from a URL and a zipped file (csv) : csv_to_parquet( path_to_file = "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip", filename_in_zip = "census2021-ts007-ctry.csv", path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a URL and a zipped file (txt) : csv_to_parquet( path_to_file = "https://sourceforge.net/projects/irisdss/files/latest/download", filename_in_zip = "IRIS TEST data.txt", path_to_parquet = tempfile(fileext=".parquet") ) ## Not run: # Conversion from a URL and a csv file with "gzip" compression : csv_to_parquet( path_to_file = "https://github.com/sidsriv/Introduction-to-Data-Science-in-python/raw/master/census.csv", path_to_parquet = tempfile(fileext = ".parquet"), compression = "gzip", compression_level = 5 ) ## End(Not run)
# Conversion from a local csv file to a single parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext=".parquet") ) # Conversion from a local txt file to a single parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.txt"), path_to_parquet = tempfile(fileext=".parquet") ) # Conversion from a local csv file to a single parquet file and select only # few columns : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext = ".parquet"), columns = c("REG","LIBELLE") ) # Conversion from a local csv file to a partitioned parquet file : csv_to_parquet( path_to_file = parquetize_example("region_2022.csv"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("REG") ) # Conversion from a URL and a zipped file (csv) : csv_to_parquet( path_to_file = "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip", filename_in_zip = "census2021-ts007-ctry.csv", path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a URL and a zipped file (txt) : csv_to_parquet( path_to_file = "https://sourceforge.net/projects/irisdss/files/latest/download", filename_in_zip = "IRIS TEST data.txt", path_to_parquet = tempfile(fileext=".parquet") ) ## Not run: # Conversion from a URL and a csv file with "gzip" compression : csv_to_parquet( path_to_file = "https://github.com/sidsriv/Introduction-to-Data-Science-in-python/raw/master/census.csv", path_to_parquet = tempfile(fileext = ".parquet"), compression = "gzip", compression_level = 5 ) ## End(Not run)
This function allows to convert a SQL query from a DBI to parquet format.
It handles all DBI supported databases.
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
Examples explain how to convert a query to a chunked dataset
dbi_to_parquet( conn, sql_query, path_to_parquet, max_memory, max_rows, chunk_memory_sample_lines = 10000, partition = "no", compression = "snappy", compression_level = NULL, ... )
dbi_to_parquet( conn, sql_query, path_to_parquet, max_memory, max_rows, chunk_memory_sample_lines = 10000, partition = "no", compression = "snappy", compression_level = NULL, ... )
conn |
A DBIConnection object, as return by DBI::dbConnect |
sql_query |
a character string containing an SQL query (this argument is passed to DBI::dbSendQuery) |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
max_memory |
Memory size (in Mb) in which data of one parquet file should roughly fit. |
max_rows |
Number of lines that defines the size of the chunk. This argument can not be filled in if max_memory is used. |
chunk_memory_sample_lines |
Number of lines to read to evaluate max_memory. Default to 10 000. |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
# Conversion from a sqlite dbi connection to a single parquet file : dbi_connection <- DBI::dbConnect(RSQLite::SQLite(), system.file("extdata","iris.sqlite",package = "parquetize")) # Reading iris table from local sqlite database # and conversion to one parquet file : dbi_to_parquet( conn = dbi_connection, sql_query = "SELECT * FROM iris", path_to_parquet = tempfile(fileext=".parquet"), ) # Reading iris table from local sqlite database by chunk (using # `max_memory` argument) and conversion to multiple parquet files dbi_to_parquet( conn = dbi_connection, sql_query = "SELECT * FROM iris", path_to_parquet = tempdir(), max_memory = 2 / 1024 ) # Using chunk and partition together is not possible directly but easy to do : # Reading iris table from local sqlite database by chunk (using # `max_memory` argument) and conversion to arrow dataset partitioned by # species # get unique values of column "iris from table "iris" partitions <- get_partitions(dbi_connection, table = "iris", column = "Species") # loop over those values for (species in partitions) { dbi_to_parquet( conn = dbi_connection, # use glue_sql to create the query filtering the partition sql_query = glue::glue_sql("SELECT * FROM iris where Species = {species}", .con = dbi_connection), # add the partition name in the output dir to respect parquet partition schema path_to_parquet = file.path(tempdir(), "iris", paste0("Species=", species)), max_memory = 2 / 1024, ) } # If you need a more complicated query to get your partitions, you can use # dbGetQuery directly : col_to_partition <- DBI::dbGetQuery(dbi_connection, "SELECT distinct(`Species`) FROM `iris`")[,1]
# Conversion from a sqlite dbi connection to a single parquet file : dbi_connection <- DBI::dbConnect(RSQLite::SQLite(), system.file("extdata","iris.sqlite",package = "parquetize")) # Reading iris table from local sqlite database # and conversion to one parquet file : dbi_to_parquet( conn = dbi_connection, sql_query = "SELECT * FROM iris", path_to_parquet = tempfile(fileext=".parquet"), ) # Reading iris table from local sqlite database by chunk (using # `max_memory` argument) and conversion to multiple parquet files dbi_to_parquet( conn = dbi_connection, sql_query = "SELECT * FROM iris", path_to_parquet = tempdir(), max_memory = 2 / 1024 ) # Using chunk and partition together is not possible directly but easy to do : # Reading iris table from local sqlite database by chunk (using # `max_memory` argument) and conversion to arrow dataset partitioned by # species # get unique values of column "iris from table "iris" partitions <- get_partitions(dbi_connection, table = "iris", column = "Species") # loop over those values for (species in partitions) { dbi_to_parquet( conn = dbi_connection, # use glue_sql to create the query filtering the partition sql_query = glue::glue_sql("SELECT * FROM iris where Species = {species}", .con = dbi_connection), # add the partition name in the output dir to respect parquet partition schema path_to_parquet = file.path(tempdir(), "iris", paste0("Species=", species)), max_memory = 2 / 1024, ) } # If you need a more complicated query to get your partitions, you can use # dbGetQuery directly : col_to_partition <- DBI::dbGetQuery(dbi_connection, "SELECT distinct(`Species`) FROM `iris`")[,1]
This function will download the file if the file is remote and
unzip it if it is zipped. It will just return the input path argument if
it's neither.
If the zip contains multiple files, you can use filename_in_zip
to set the file you want to unzip and use.
You can pipe output on all *_to_parquet
functions.
download_extract(path, filename_in_zip)
download_extract(path, filename_in_zip)
path |
the input file's path or url. |
filename_in_zip |
name of the csv file in the zip. Required if several csv are included in the zip. |
the path to the usable (uncompressed) file, invisibly.
# 1. unzip a local zip file # 2. parquetize it file_path <- download_extract(system.file("extdata","mtcars.csv.zip", package = "readr")) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") ) # 1. download a remote file # 2. extract the file census2021-ts007-ctry.csv # 3. parquetize it file_path <- download_extract( "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip", filename_in_zip = "census2021-ts007-ctry.csv" ) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") ) # the file is local and not zipped so : # 1. parquetize it file_path <- download_extract(parquetize_example("region_2022.csv")) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") )
# 1. unzip a local zip file # 2. parquetize it file_path <- download_extract(system.file("extdata","mtcars.csv.zip", package = "readr")) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") ) # 1. download a remote file # 2. extract the file census2021-ts007-ctry.csv # 3. parquetize it file_path <- download_extract( "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip", filename_in_zip = "census2021-ts007-ctry.csv" ) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") ) # the file is local and not zipped so : # 1. parquetize it file_path <- download_extract(parquetize_example("region_2022.csv")) csv_to_parquet( file_path, path_to_parquet = tempfile(fileext = ".parquet") )
This function allows to convert a fst file to parquet format.
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
fst_to_parquet( path_to_file, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
fst_to_parquet( path_to_file, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
# Conversion from a local fst file to a single parquet file :: fst_to_parquet( path_to_file = system.file("extdata","iris.fst",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local fst file to a partitioned parquet file :: fst_to_parquet( path_to_file = system.file("extdata","iris.fst",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("Species") )
# Conversion from a local fst file to a single parquet file :: fst_to_parquet( path_to_file = system.file("extdata","iris.fst",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local fst file to a partitioned parquet file :: fst_to_parquet( path_to_file = system.file("extdata","iris.fst",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("Species") )
One very important parquet metadata is the row group size.
If it's value is low (below 10 000), you should rebuild your parquet files.
Normal value is between 30 000 and 1 000 000
get_parquet_info(path)
get_parquet_info(path)
path |
parquet file path or directory. If directory is given,
|
a tibble with 5 columns :
path, file path
num_rows, number of rows
num_row_groups, number of group row
num_columns,
row_group_size, mean row group size
If one column contain NA
, parquet file may be malformed.
get_parquet_info(system.file("extdata", "iris.parquet", package = "parquetize")) get_parquet_info(system.file("extdata", "iris_dataset", package = "parquetize"))
get_parquet_info(system.file("extdata", "iris.parquet", package = "parquetize")) get_parquet_info(system.file("extdata", "iris_dataset", package = "parquetize"))
This function allows you to extract unique values from a table's column to use as partitions.
Internally, this function does "SELECT DISTINCT(mycolumn
) FROM mytable
;"
get_partitions(conn, table, column)
get_partitions(conn, table, column)
conn |
A |
table |
a DB table name |
column |
a column name for the table passed in param |
a vector with unique values for the column of the table
dbi_connection <- DBI::dbConnect(RSQLite::SQLite(), system.file("extdata","iris.sqlite",package = "parquetize")) get_partitions(dbi_connection, "iris", "Species")
dbi_connection <- DBI::dbConnect(RSQLite::SQLite(), system.file("extdata","iris.sqlite",package = "parquetize")) get_partitions(dbi_connection, "iris", "Species")
This function allows to convert a json
or ndjson file to parquet format.
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
json_to_parquet( path_to_file, path_to_parquet, format = "json", partition = "no", compression = "snappy", compression_level = NULL, ... )
json_to_parquet( path_to_file, path_to_parquet, format = "json", partition = "no", compression = "snappy", compression_level = NULL, ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
format |
string that indicates if the format is "json" (by default) or "ndjson" |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
# Conversion from a local json file to a single parquet file :: json_to_parquet( path_to_file = system.file("extdata","iris.json",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local ndjson file to a partitioned parquet file :: json_to_parquet( path_to_file = system.file("extdata","iris.ndjson",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), format = "ndjson" )
# Conversion from a local json file to a single parquet file :: json_to_parquet( path_to_file = system.file("extdata","iris.json",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local ndjson file to a partitioned parquet file :: json_to_parquet( path_to_file = system.file("extdata","iris.ndjson",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), format = "ndjson" )
parquetize comes bundled with a number of sample files in its inst/extdata
directory. This function make them easy to access
parquetize_example(file = NULL)
parquetize_example(file = NULL)
file |
Name of file or directory. If |
A character string
parquetize_example() parquetize_example("region_2022.csv") parquetize_example("iris_dataset")
parquetize_example() parquetize_example("region_2022.csv") parquetize_example("iris_dataset")
This function read all parquet files in folder
argument that starts with output_name
,
combine them using rbind and write the result to a new parquet file.
It can also delete the initial files if delete_initial_files
argument is TRUE.
Be careful, this function will not work if files with different structures
are present in the folder given with the argument folder
.
rbind_parquet( folder, output_name, delete_initial_files = TRUE, compression = "snappy", compression_level = NULL )
rbind_parquet( folder, output_name, delete_initial_files = TRUE, compression = "snappy", compression_level = NULL )
folder |
the folder where the initial files are stored |
output_name |
name of the output parquet file |
delete_initial_files |
Boolean. Should the function delete the initial files ? By default TRUE. |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
Parquet files, invisibly
## Not run: library(arrow) if (file.exists('output')==FALSE) { dir.create("output") } file.create(fileext = "output/test_data1-4.parquet") write_parquet(data.frame( x = c("a","b","c"), y = c(1L,2L,3L) ), "output/test_data1-4.parquet") file.create(fileext = "output/test_data4-6.parquet") write_parquet(data.frame( x = c("d","e","f"), y = c(4L,5L,6L) ), "output/test_data4-6.parquet") test_data <- rbind_parquet(folder = "output", output_name = "test_data", delete_initial_files = FALSE) ## End(Not run)
## Not run: library(arrow) if (file.exists('output')==FALSE) { dir.create("output") } file.create(fileext = "output/test_data1-4.parquet") write_parquet(data.frame( x = c("a","b","c"), y = c(1L,2L,3L) ), "output/test_data1-4.parquet") file.create(fileext = "output/test_data4-6.parquet") write_parquet(data.frame( x = c("d","e","f"), y = c(4L,5L,6L) ), "output/test_data4-6.parquet") test_data <- rbind_parquet(folder = "output", output_name = "test_data", delete_initial_files = FALSE) ## End(Not run)
This function allows to convert a rds file to parquet format.
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
rds_to_parquet( path_to_file, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
rds_to_parquet( path_to_file, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
# Conversion from a local rds file to a single parquet file :: rds_to_parquet( path_to_file = system.file("extdata","iris.rds",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local rds file to a partitioned parquet file :: rds_to_parquet( path_to_file = system.file("extdata","iris.rds",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("Species") )
# Conversion from a local rds file to a single parquet file :: rds_to_parquet( path_to_file = system.file("extdata","iris.rds",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local rds file to a partitioned parquet file :: rds_to_parquet( path_to_file = system.file("extdata","iris.rds",package = "parquetize"), path_to_parquet = tempfile(fileext = ".parquet"), partition = "yes", partitioning = c("Species") )
This function allows to convert a table from a sqlite file to parquet format.
The following extensions are supported :
"db","sdb","sqlite","db3","s3db","sqlite3","sl3","db2","s2db","sqlite2","sl2".
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
sqlite_to_parquet( path_to_file, table_in_sqlite, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
sqlite_to_parquet( path_to_file, table_in_sqlite, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
table_in_sqlite |
string that indicates the name of the table to convert in the sqlite file |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
A parquet file, invisibly
# Conversion from a local sqlite file to a single parquet file : sqlite_to_parquet( path_to_file = system.file("extdata","iris.sqlite",package = "parquetize"), table_in_sqlite = "iris", path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local sqlite file to a partitioned parquet file : sqlite_to_parquet( path_to_file = system.file("extdata","iris.sqlite",package = "parquetize"), table_in_sqlite = "iris", path_to_parquet = tempfile(), partition = "yes", partitioning = c("Species") )
# Conversion from a local sqlite file to a single parquet file : sqlite_to_parquet( path_to_file = system.file("extdata","iris.sqlite",package = "parquetize"), table_in_sqlite = "iris", path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a local sqlite file to a partitioned parquet file : sqlite_to_parquet( path_to_file = system.file("extdata","iris.sqlite",package = "parquetize"), table_in_sqlite = "iris", path_to_parquet = tempfile(), partition = "yes", partitioning = c("Species") )
This function allows to convert an input file to parquet format.
It handles SAS, SPSS and Stata files in a same function. There is only one function to use for these 3 cases.
For these 3 cases, the function guesses the data format using the extension of the input file (in the path_to_file
argument).
Two conversions possibilities are offered :
Convert to a single parquet file. Argument path_to_parquet
must then be used;
Convert to a partitioned parquet file. Additionnal arguments partition
and partitioning
must then be used;
To avoid overcharging R's RAM, the conversion can be done by chunk. One of arguments max_memory
or max_rows
must then be used.
This is very useful for huge tables and for computers with little RAM because the conversion is then done
with less memory consumption. For more information, see here.
table_to_parquet( path_to_file, path_to_parquet, max_memory = NULL, max_rows = NULL, chunk_size = lifecycle::deprecated(), chunk_memory_size = lifecycle::deprecated(), columns = "all", by_chunk = lifecycle::deprecated(), skip = 0, partition = "no", encoding = NULL, chunk_memory_sample_lines = 10000, compression = "snappy", compression_level = NULL, user_na = FALSE, ... )
table_to_parquet( path_to_file, path_to_parquet, max_memory = NULL, max_rows = NULL, chunk_size = lifecycle::deprecated(), chunk_memory_size = lifecycle::deprecated(), columns = "all", by_chunk = lifecycle::deprecated(), skip = 0, partition = "no", encoding = NULL, chunk_memory_sample_lines = 10000, compression = "snappy", compression_level = NULL, user_na = FALSE, ... )
path_to_file |
String that indicates the path to the input file (don't forget the extension). |
path_to_parquet |
String that indicates the path to the directory where the parquet files will be stored. |
max_memory |
Memory size (in Mb) in which data of one parquet file should roughly fit. |
max_rows |
Number of lines that defines the size of the chunk. This argument can not be filled in if max_memory is used. |
chunk_size |
DEPRECATED use max_rows |
chunk_memory_size |
DEPRECATED use max_memory |
columns |
Character vector of columns to select from the input file (by default, all columns are selected). |
by_chunk |
DEPRECATED use max_memory or max_rows instead |
skip |
By default 0. This argument must be filled in if |
partition |
String ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
encoding |
String that indicates the character encoding for the input file. |
chunk_memory_sample_lines |
Number of lines to read to evaluate max_memory. Default to 10 000. |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
user_na |
If |
... |
Additional format-specific arguments, see arrow::write_parquet() and arrow::write_dataset() for more informations. |
Parquet files, invisibly
# Conversion from a SAS file to a single parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a SPSS file to a single parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet"), ) # Conversion from a Stata file to a single parquet file without progress bar : table_to_parquet( path_to_file = system.file("examples","iris.dta", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet") ) # Reading SPSS file by chunk (using `max_rows` argument) # and conversion to multiple parquet files : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, ) # Reading SPSS file by chunk (using `max_memory` argument) # and conversion to multiple parquet files of 5 Kb when loaded (5 Mb / 1024) # (in real files, you should use bigger value that fit in memory like 3000 # or 4000) : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(), max_memory = 5 / 1024 ) # Reading SAS file by chunk of 50 lines with encoding # and conversion to multiple files : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, encoding = "utf-8" ) # Conversion from a SAS file to a single parquet file and select only # few columns : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet"), columns = c("Species","Petal_Length") ) # Conversion from a SAS file to a partitioned parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), partition = "yes", partitioning = c("Species") # vector use as partition key ) # Reading SAS file by chunk of 50 lines # and conversion to multiple files with zstd, compression level 10 if (isTRUE(arrow::arrow_info()$capabilities[['zstd']])) { table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, compression = "zstd", compression_level = 10 ) }
# Conversion from a SAS file to a single parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet") ) # Conversion from a SPSS file to a single parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet"), ) # Conversion from a Stata file to a single parquet file without progress bar : table_to_parquet( path_to_file = system.file("examples","iris.dta", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet") ) # Reading SPSS file by chunk (using `max_rows` argument) # and conversion to multiple parquet files : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, ) # Reading SPSS file by chunk (using `max_memory` argument) # and conversion to multiple parquet files of 5 Kb when loaded (5 Mb / 1024) # (in real files, you should use bigger value that fit in memory like 3000 # or 4000) : table_to_parquet( path_to_file = system.file("examples","iris.sav", package = "haven"), path_to_parquet = tempfile(), max_memory = 5 / 1024 ) # Reading SAS file by chunk of 50 lines with encoding # and conversion to multiple files : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, encoding = "utf-8" ) # Conversion from a SAS file to a single parquet file and select only # few columns : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(fileext = ".parquet"), columns = c("Species","Petal_Length") ) # Conversion from a SAS file to a partitioned parquet file : table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), partition = "yes", partitioning = c("Species") # vector use as partition key ) # Reading SAS file by chunk of 50 lines # and conversion to multiple files with zstd, compression level 10 if (isTRUE(arrow::arrow_info()$capabilities[['zstd']])) { table_to_parquet( path_to_file = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 50, compression = "zstd", compression_level = 10 ) }
Low level function that implements the logic to write a parquet file or a dataset from data
write_parquet_at_once( data, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
write_parquet_at_once( data, path_to_parquet, partition = "no", compression = "snappy", compression_level = NULL, ... )
data |
the data.frame/tibble to write |
path_to_parquet |
String that indicates the path to the directory where the output parquet file or dataset will be stored. |
partition |
string ("yes" or "no" - by default) that indicates whether you want to create a partitioned parquet file.
If "yes", |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
Additional format-specific arguments, see arrow::write_parquet() |
a dataset as return by arrow::open_dataset
write_parquet_at_once(iris, tempfile()) write_parquet_at_once(iris, tempfile(), partition = "yes", partitioning = c("Species")) ## Not run: write_parquet_at_once(iris, tempfile(), compression="gzip", compression_level = 5) ## End(Not run)
write_parquet_at_once(iris, tempfile()) write_parquet_at_once(iris, tempfile(), partition = "yes", partitioning = c("Species")) ## Not run: write_parquet_at_once(iris, tempfile(), compression="gzip", compression_level = 5) ## End(Not run)
Low level function that implements the logic to to read input file by chunk and write a
dataset.
It will:
calculate the number of row by chunk if needed;
loop over the input file by chunk;
write each output files.
write_parquet_by_chunk( read_method, input, path_to_parquet, max_rows = NULL, max_memory = NULL, chunk_memory_sample_lines = 10000, compression = "snappy", compression_level = NULL, ... )
write_parquet_by_chunk( read_method, input, path_to_parquet, max_rows = NULL, max_memory = NULL, chunk_memory_sample_lines = 10000, compression = "snappy", compression_level = NULL, ... )
read_method |
a method to read input files. This method take only three arguments
This method will be called until it returns a dataframe/tibble with zero row. |
input |
that indicates the path to the input. It can be anything you want but more often a file's path or a data.frame. |
path_to_parquet |
String that indicates the path to the directory where the output parquet file or dataset will be stored. |
max_rows |
Number of lines that defines the size of the chunk. This argument can not be filled in if max_memory is used. |
max_memory |
Memory size (in Mb) in which data of one parquet file should roughly fit. |
chunk_memory_sample_lines |
Number of lines to read to evaluate max_memory. Default to 10 000. |
compression |
compression algorithm. Default "snappy". |
compression_level |
compression level. Meaning depends on compression algorithm. |
... |
Additional format-specific arguments, see arrow::write_parquet() |
a dataset as return by arrow::open_dataset
# example with a dataframe # we create the function to loop over the data.frame read_method <- function(input, skip = 0L, n_max = Inf) { # if we are after the end of the input we return an empty data.frame if (skip+1 > nrow(input)) { return(data.frame()) } # return the n_max row from skip + 1 input[(skip+1):(min(skip+n_max, nrow(input))),] } # we use it write_parquet_by_chunk( read_method = read_method, input = mtcars, path_to_parquet = tempfile(), max_rows = 10, ) # # Example with haven::read_sas # # we need to pass two argument beside the 3 input, skip and n_max. # We will use a closure : my_read_closure <- function(encoding, columns) { function(input, skip = OL, n_max = Inf) { haven::read_sas(data_file = input, n_max = n_max, skip = skip, encoding = encoding, col_select = all_of(columns)) } } # we initialize the closure read_method <- my_read_closure(encoding = "WINDOWS-1252", columns = c("Species", "Petal_Width")) # we use it write_parquet_by_chunk( read_method = read_method, input = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 75, )
# example with a dataframe # we create the function to loop over the data.frame read_method <- function(input, skip = 0L, n_max = Inf) { # if we are after the end of the input we return an empty data.frame if (skip+1 > nrow(input)) { return(data.frame()) } # return the n_max row from skip + 1 input[(skip+1):(min(skip+n_max, nrow(input))),] } # we use it write_parquet_by_chunk( read_method = read_method, input = mtcars, path_to_parquet = tempfile(), max_rows = 10, ) # # Example with haven::read_sas # # we need to pass two argument beside the 3 input, skip and n_max. # We will use a closure : my_read_closure <- function(encoding, columns) { function(input, skip = OL, n_max = Inf) { haven::read_sas(data_file = input, n_max = n_max, skip = skip, encoding = encoding, col_select = all_of(columns)) } } # we initialize the closure read_method <- my_read_closure(encoding = "WINDOWS-1252", columns = c("Species", "Petal_Width")) # we use it write_parquet_by_chunk( read_method = read_method, input = system.file("examples","iris.sas7bdat", package = "haven"), path_to_parquet = tempfile(), max_rows = 75, )