Thousands of Papers to Dataframe

Alfonso R. Reyes

2017-10-27

The maximum number of rows that a OnePetro query can return is 1000. It means that the user could set up the query to return up to amximum of 1000 papers. Abover that number, the query to OnePetro will return error.

OnePetro has options to define the number of rows to display at 10, 50 and 100 rows. Additionally, through scripts like these, that number could be raised up to 1,000.

This article describes the process of reading multiple pages with thousand of papers to a unique dataframe.

Retrieve the most numerous paper by type

library(petro.One)

What type of paper do we have?

papers_by_type(my_url)
## # A tibble: 7 x 2
##               name value
##              <chr> <dbl>
## 1          Chapter     1
## 2 Conference paper  3076
## 3          General    60
## 4    Journal paper   895
## 5            Media     5
## 6            Other     1
## 7     Presentation     7

For the tyme being we will retrieve only conference papers.

Collect first 1000 rows

# we use "conference-paper" only because other document types have
# different dataframe structure

my_url_1 <- make_search_url(query = "pressure transient analysis", 
                          how = "all", 
                          dc_type = "conference-paper",
                          start = 0,
                          rows  = 1000)

get_papers_count(my_url_1)
## [1] 3076
page_1 <- read_onepetro(my_url_1)
htm_1 <- "pta-01-conference.html"
xml2::write_html(page_1, file = htm_1)
onepetro_page_to_dataframe(htm_1)
## # A tibble: 1,000 x 6
##                                                         title_data
##                                                              <chr>
##  1                             Pressure Transient Analysis in SAGD
##  2                           Well-head Pressure Transient Analysis
##  3     Automated Pressure Transient Analysis with Smart Technology
##  4  Pressure Transient Analysis in Multilayered Faulted Reservoirs
##  5  Pressure Transient Analysis of Multifractured Horizontal Wells
##  6 Integrating Pressure Transient Analysis in Hydraulic Fracturing
##  7         Software Showcase: Pressure Transient Analysis Programs
##  8             Numerical Solutions for Pressure Transient Analysis
##  9        How Wellbore Dynamics Affect Pressure Transient Analysis
## 10                Pressure-Transient Analysis for Perforated Wells
## # ... with 990 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>

Collect second set of 1000 rows

my_url_2 <- make_search_url(query = "pressure transient analysis", 
                          how = "all", 
                          dc_type = "conference-paper",
                          start = 1000,
                          rows  = 1000)

page_2 <- read_onepetro(my_url_2)
htm_2 <- "pta-02-conference.html"
xml2::write_html(page_2, file = htm_2)
onepetro_page_to_dataframe(htm_2)
## # A tibble: 1,000 x 6
##                                                                     title_data
##                                                                          <chr>
##  1 Geophysical Monitoring of the Multilayer Reservoir with of Flooding and Ind
##  2 Geophysical Monitoring of the Multilayer Reservoir with of Flooding and Ind
##  3                             Low Salinity Flooding Trial at West Salym Field
##  4 A Novel Approach for Production Transient Analysis of Shale Gas/Oil Reservo
##  5 Physics-Based Approach for Shale Gas Numerical Simulation: Quintuple Porosi
##  6 Application of Multi-Level and High-Resolution Fracture Modeling in Field-S
##  7          The Value of Transient Temperature Responses in Testing Operations
##  8 An Improved Boundary Element Method for Modeling Fluid Flow through Fractur
##  9 A Semi-Analytical Model for Extended-Reach Wells with Wellbore Flow Splitti
## 10    Modeling and Interpretation of the Bottomhole Temperature Transient Data
## # ... with 990 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>

Collect next set of 1000 rows

my_url_3 <- make_search_url(query = "pressure transient analysis", 
                          how = "all", 
                          dc_type = "conference-paper",
                          start = 2000,
                          rows  = 1000)

page_3 <- read_onepetro(my_url_3)
htm_3 <- "pta-03-conference.html"
xml2::write_html(page_3, file = htm_3)
onepetro_page_to_dataframe(htm_3)
## # A tibble: 1,000 x 6
##                                                                     title_data
##                                                                          <chr>
##  1 Application of a Systematic Technique for the Dynamic Characterization of C
##  2 Establishing Key Reservoir Parameters With Diagnostic Fracture Injection Te
##  3 Successful Exploitation of Heterogeneous Unconsolidated Clastic Gas Reservo
##  4 New Structural Evolution Model for the North Kuwait Carbonate Fields and it
##  5 Modeling and History Matching of Hydrocarbon Production from Marcellus Shal
##  6 Improved Permeability Prediction in Heterogeneous Carbonate Reservoirs: A n
##  7               The Emerging Unconventional Upper Jurassic Oil Play in Mexico
##  8 Why Double Porosity Models Are Not Applicable To Simulating The Gas Condens
##  9 Why Dual Porosity Models are not Applicable for Simulation of the Near-Well
## 10 Heat Transfer Ahead of a SAGD Steam Chamber: A Study of Thermocouple Data F
## # ... with 990 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>

Collect remaining set

my_url_4 <- make_search_url(query = "pressure transient analysis", 
                          how = "all", 
                          dc_type = "conference-paper",
                          start = 3000,
                          rows  = 100)

page_4 <- read_onepetro(my_url_4)
htm_4 <- "pta-04-conference.html"
xml2::write_html(page_4, file = htm_4)
onepetro_page_to_dataframe(htm_4)
## # A tibble: 76 x 6
##                                                                     title_data
##                                                                          <chr>
##  1      Transient Analysis of Tight Gas Well Performance - More Case Histories
##  2 Pressure Transient and Decline Curve Behaviors in Naturally Fractured Vuggy
##  3 The Significance of Non-Darcy and Multiphase Flow Effects in High-Rate, Fra
##  4      Effect Of Drainage Area Shapes On The Productivity Of Horizontal Wells
##  5 A Predictive Model for Analyzing Erosional Velocity and Corrosion Effects E
##  6 A Step by Step Approach to Hydraulic Fracture Treatment Design, Implementat
##  7 Approximate Analytical Solutions for the Pressure Response at a Water Injec
##  8 Fluid Flow in a Fractured Reservoir Using a Geomechanically-Constrained Fau
##  9 Evaluating Barnett Shale Production Performance-Using an Integrated Approac
## 10 A Semi-Analytic (p/z) Rate-Time Relation for the Analysis and Prediction of
## # ... with 66 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>

Binding tables in one dataframe

p1 <- onepetro_page_to_dataframe(htm_1)
p2 <- onepetro_page_to_dataframe(htm_2)
p3 <- onepetro_page_to_dataframe(htm_3)
p4 <- onepetro_page_to_dataframe(htm_4)

papers <- rbind(p1, p2, p3, p4)
papers
## # A tibble: 3,076 x 6
##                                                         title_data
##                                                              <chr>
##  1                             Pressure Transient Analysis in SAGD
##  2                           Well-head Pressure Transient Analysis
##  3     Automated Pressure Transient Analysis with Smart Technology
##  4  Pressure Transient Analysis in Multilayered Faulted Reservoirs
##  5  Pressure Transient Analysis of Multifractured Horizontal Wells
##  6 Integrating Pressure Transient Analysis in Hydraulic Fracturing
##  7         Software Showcase: Pressure Transient Analysis Programs
##  8             Numerical Solutions for Pressure Transient Analysis
##  9        How Wellbore Dynamics Affect Pressure Transient Analysis
## 10                Pressure-Transient Analysis for Perforated Wells
## # ... with 3,066 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>

Find which papers have the search word in the title

pattern <- "pressure transient analysis"
rows <- grep(pattern = pattern, papers$title_data, ignore.case = TRUE)
papers[rows, ]
## # A tibble: 163 x 6
##                                                                 title_data
##                                                                      <chr>
##  1                                     Pressure Transient Analysis in SAGD
##  2                                   Well-head Pressure Transient Analysis
##  3             Automated Pressure Transient Analysis with Smart Technology
##  4          Pressure Transient Analysis in Multilayered Faulted Reservoirs
##  5          Pressure Transient Analysis of Multifractured Horizontal Wells
##  6         Integrating Pressure Transient Analysis in Hydraulic Fracturing
##  7                 Software Showcase: Pressure Transient Analysis Programs
##  8                     Numerical Solutions for Pressure Transient Analysis
##  9                How Wellbore Dynamics Affect Pressure Transient Analysis
## 10 Pressure Transient Analysis and Inflow Performance for Horizontal Wells
## # ... with 153 more rows, and 5 more variables: paper_id <chr>,
## #   source <chr>, type <chr>, year <int>, author1_data <chr>
# remove files that were created
files <- c(htm_1, htm_2, htm_3, htm_4)
file.remove(files)
## [1] TRUE TRUE TRUE TRUE